Categorical Data

R Packages

  • rcistats
  • tidyverse
  • ggthemes

Heart Disease

  • Heart Disease

  • Categorical Data

  • Continguency Tables

  • Bar Plots

  • Cross-Tabulation

  • Pie Charts

  • Theming

Heart Disease Data

The heart_disease data set provides heart disease information on patients from Cleveland, Ohio. The data was originally published in the American Journal or Cardiology.

An image of a graph and a heart.

Data

Code
heart_disease |> 
  DT::datatable(options = list(dom = "t",
                pageLength = 4))

Variables of Interest

  • cp: Type of Chest Pain
  • disease: Indicating if they have heart disease

Categorical Data

  • Heart Disease

  • Categorical Data

  • Continguency Tables

  • Bar Plots

  • Cross-Tabulation

  • Pie Charts

  • Theming

Categorical Data

Categorical data are data recordings that represented a category.

Data may be recorded as a “character” or “string” data.

Data may be recorded as a whole number, with an attached code book indicating the categories each number belongs to.

Examples of Categorical Data

  • Are you a student?

  • What city do you live in?

  • What is your major?

Likert Scale

Likert scales are the rating systems you may have answered in surveys.

  1. Strongly Disagree
  2. Disagree
  3. Neutral
  4. Agree
  5. Strongly Agree

Likert Scales

Likert scales may be treated as numerical data if the jumps between scales are equal.

Summarizing Categorical Data

Once we have the data, how do we summarize it to other people.

Continguency Tables

  • Heart Disease

  • Categorical Data

  • Continguency Tables

  • Bar Plots

  • Cross-Tabulation

  • Pie Charts

  • Theming

Continguency Tables

Continguency tables display how often a category is seen in the data.

There are two types of statistics that are reported in a table, the frequency and proportion.

Frequencey

Frequency represents the count of observing a specific category in your sample.

#> [1] Asymptomatic     Asymptomatic     Non-anginal Pain Atypical Angina 
#> [5] Asymptomatic     Atypical Angina  Non-anginal Pain Atypical Angina 
#> Levels: Asymptomatic Non-anginal Pain Atypical Angina Typical Angina

Proportions (relative frequencey)

Proportions represent the percentage that the category represents the sample.

This allows you to generalize your sample to the population, regardless of sample size.

Continguency Tables in R

cat_stats(DATA$VAR)
  • DATA: Name of the data frame (eg: heart_disease)
  • VAR: Name of the variable to create a plot (eg: cp)

Example

The variable cp indicates the type of chest pain.

cat_stats(heart_disease$cp)
#> Continguency Table 
#>  
#>                    n   prop
#> Asymptomatic     142 0.4781
#> Atypical Angina   49 0.1650
#> Non-anginal Pain  83 0.2795
#> Typical Angina    23 0.0774
#> 
#> Number of Missing: 0
#> Proportion of Missing: 0
#> Row Variable: heart_disease$cp

Bar Plots

  • Heart Disease

  • Categorical Data

  • Continguency Tables

  • Bar Plots

  • Cross-Tabulation

  • Pie Charts

  • Theming

Plotting in R

Plotting in R can be done via the ggplot2, a powerful library based on the Grammar of Graphics.

Plotting in R

  1. You need to create a base plot using the ggplot()
  2. Use the + to change the look of the base plot
  3. Indicate how to transform the base plot to the desired plot
    1. geom_*
    2. stat_*
  4. Change the look of the plot with other functions
  5. Use a theme_* function to add a theme to the plot

Bar Plots

Bar Plots can be used to display the frequency or proportions on the data.

Frequency Bar Plots in R

ggplot(data = DATA, aes(x = VAR)) +
  geom_bar()
  • DATA: Name of the data frame (eg: heart_disease)
  • VAR: Name of the variable to create a plot (eg: cp)

Frequency Bar Plots in R

Code
ggplot(heart_disease, aes(cp)) +
  geom_bar() 

Relative Frequency Bar Plots in R

ggplot(data = DATA, aes(x = VAR, y = after_stat(prop), group = 1)) +
  geom_bar()
  • DATA: Name of the data frame (eg: heart_disease)
  • VAR: Name of the variable to create a plot (eg: cp)

Relative Frequency Bar Plots in R

Code
ggplot(heart_disease, aes(cp, after_stat(prop), group = 1)) +
  geom_bar() 

Cross-Tabulation

  • Heart Disease

  • Categorical Data

  • Continguency Tables

  • Bar Plots

  • Cross-Tabulation

  • Pie Charts

  • Theming

Data

The variable disease indicates if a patient has heart disease.

cat_stats(heart_disease$disease)
#> Continguency Table 
#>  
#>       n   prop
#> no  160 0.5387
#> yes 137 0.4613
#> 
#> Number of Missing: 0
#> Proportion of Missing: 0
#> Row Variable: heart_disease$disease

Cross-Tabulation

Cross-tabulations, also known as contingency tables, are statistical tools used to analyze the relationship between two or more categorical variables by displaying their frequency distribution in a table format. Each cell in the table represents the count or frequency of observations that fall into a particular combination of categories for the variables.

Key Features of Cross-Tabulations

  1. Rows and Columns:
    • Rows represent the categories of one variable.
    • Columns represent the categories of another variable.
  2. Cells:
    • Each cell displays the frequency or count of data points that belong to the intersection of a row and column category.

Cross-Tabulations in R

cat_stats(DATA$VAR1, DATA$VAR2)
  • DATA: Name of the data frame (eg: heart_disease)
  • VAR1: Name of the first variable to create the cross-tab (eg: cp)
  • VAR2: Name of the second variable to create the cross-tab (eg: disease)

Cross-Tabs Example

cat_stats(heart_disease$cp, heart_disease$disease)
#> Continguency Table 
#>  
#> Column Variable: heart_disease$disease
#> Row Variable: heart_disease$cp
#> $frequency
#>                   
#>                     no yes
#>   Asymptomatic      39 103
#>   Non-anginal Pain  65  18
#>   Atypical Angina   40   9
#>   Typical Angina    16   7
#> 
#> $table_prop
#>                   
#>                        no    yes
#>   Asymptomatic     0.1313 0.3468
#>   Non-anginal Pain 0.2189 0.0606
#>   Atypical Angina  0.1347 0.0303
#>   Typical Angina   0.0539 0.0236
#> 
#> $row_prop
#>                   
#>                        no    yes
#>   Asymptomatic     0.2746 0.7254
#>   Non-anginal Pain 0.7831 0.2169
#>   Atypical Angina  0.8163 0.1837
#>   Typical Angina   0.6957 0.3043
#> 
#> $col_prop
#>                   
#>                        no    yes
#>   Asymptomatic     0.2438 0.7518
#>   Non-anginal Pain 0.4062 0.1314
#>   Atypical Angina  0.2500 0.0657
#>   Typical Angina   0.1000 0.0511

Types of Props in Cross-Tabs

  1. Row Proportions: Show the percentage of each row total represented by a cell.
  2. Column Proportions: Show the percentage of each column total represented by a cell.
  3. Table Proportions: Show the percentage of the overall total represented by a cell.

Table Proportions

Table proportions in cross-tabulations refer to the relative frequency or percentage of counts within the entire table, calculated by dividing each cell’s count by the total sum of all counts in the table. These proportions allow you to examine the contribution of each cell to the overall data set.

Table Proportions

cat_stats(heart_disease$cp, heart_disease$disease, prop = "table")
#> Continguency Table 
#>  
#> Column Variable: heart_disease$disease
#> Row Variable: heart_disease$cp
#> $frequency
#>                   
#>                     no yes
#>   Asymptomatic      39 103
#>   Non-anginal Pain  65  18
#>   Atypical Angina   40   9
#>   Typical Angina    16   7
#> 
#> $table_prop
#>                   
#>                        no    yes
#>   Asymptomatic     0.1313 0.3468
#>   Non-anginal Pain 0.2189 0.0606
#>   Atypical Angina  0.1347 0.0303
#>   Typical Angina   0.0539 0.0236

Row Proportions

Row proportions refer to the relative frequency or percentage of counts within each row of a contingency table. In a cross-tabulation, row proportions allow you to compare how the distribution of one variable varies within each category of another variable, within a row.

Row Proportions

cat_stats(heart_disease$cp, heart_disease$disease, prop = "row")
#> Continguency Table 
#>  
#> Column Variable: heart_disease$disease
#> Row Variable: heart_disease$cp
#> $frequency
#>                   
#>                     no yes
#>   Asymptomatic      39 103
#>   Non-anginal Pain  65  18
#>   Atypical Angina   40   9
#>   Typical Angina    16   7
#> 
#> $row_prop
#>                   
#>                        no    yes
#>   Asymptomatic     0.2746 0.7254
#>   Non-anginal Pain 0.7831 0.2169
#>   Atypical Angina  0.8163 0.1837
#>   Typical Angina   0.6957 0.3043

Column Proportions

Column proportions refer to the relative frequency or percentage of counts within each column of a contingency table. These proportions allow you to compare how the distribution of one variable varies across different categories of another variable, within a column.

Column Proportions

cat_stats(heart_disease$cp, heart_disease$disease, prop = "col")
#> Continguency Table 
#>  
#> Column Variable: heart_disease$disease
#> Row Variable: heart_disease$cp
#> $frequency
#>                   
#>                     no yes
#>   Asymptomatic      39 103
#>   Non-anginal Pain  65  18
#>   Atypical Angina   40   9
#>   Typical Angina    16   7
#> 
#> $col_prop
#>                   
#>                        no    yes
#>   Asymptomatic     0.2438 0.7518
#>   Non-anginal Pain 0.4062 0.1314
#>   Atypical Angina  0.2500 0.0657
#>   Typical Angina   0.1000 0.0511

Stacked Bar Plot in R

ggplot(DATA, aes(x = VAR1, y = after_stat(count), fill = VAR2)) +
  geom_bar()

OR

ggplot(DATA, aes(y = VAR1, x = after_stat(count), fill = VAR2)) +
  geom_bar()
  • DATA: Name of the data frame (eg: heart_disease)
  • VAR1: Name of the first variable to create the cross-tab (eg: cp)
  • VAR2: Name of the second variable to create the cross-tab (eg: disease)

Stacked Bar Plot in R

Code
ggplot(heart_disease, aes(x = cp, y = after_stat(count), fill = disease)) +
  geom_bar() 

Stacked Bar Plot in R

Code
ggplot(heart_disease, aes(y = cp, x = after_stat(count), fill = disease)) +
  geom_bar() 

Pie Charts

  • Heart Disease

  • Categorical Data

  • Continguency Tables

  • Bar Plots

  • Cross-Tabulation

  • Pie Charts

  • Theming

Pie Charts

A pie chart is a circular statistical graphic divided into slices, where each slice represents a proportion or percentage of the whole. The size of each slice is proportional to the relative frequency or magnitude of the category it represents.

Pie Charts

Key Features of Pie Charts

  1. Circular Format:
    • The chart is shaped like a circle, symbolizing a whole (100% or 1).
  2. Slices:
    • Each slice corresponds to a category and its size represents the contribution of that category to the total.
  3. Labels:
    • Slices are often labeled with the category name and the percentage or value they represent.

Pie Chart in R

ggplot(DATA, aes(fill = VAR)) +
  geom_pie()
  • DATA: Name of the data frame (eg: heart_disease)
  • VAR: Name of the variable to create a plot (eg: cp)

Pie Chart in R

ggplot(heart_disease, aes(fill = slope)) +
  geom_pie() +
  theme( # Used to change legend
    legend.title = element_text(size = 36), # Increse title font
    legend.text = element_text(size = 30) # Increase text font
  )

Theming

  • Heart Disease

  • Categorical Data

  • Continguency Tables

  • Bar Plots

  • Cross-Tabulation

  • Pie Charts

  • Theming

Themes

The R package ggthemes allows you to change the overall look of a plot.

All you need to do is add the theme to the plot.

Installing Themes in R

Install once on your computer or new session in google colab:

rcistats::install_themes()

Then, load libraries:

library(ggthemes)

Black and White Theme

ggplot(heart_disease, aes(y = cp, x = after_stat(count), fill = disease)) +
  geom_bar() +
  theme_bw()
A bar chart with a black and white theme.

Excel Theme

ggplot(heart_disease, aes(y = cp, x = after_stat(count), fill = disease)) +
  geom_bar() +
  theme_excel()
A bar chart with an excel theme.

WSJ Theme

ggplot(heart_disease, aes(y = cp, x = after_stat(count), fill = disease)) +
  geom_bar() +
  theme_wsj()
A bar chart with a Wall Street Journal theme.

Stata Theme

ggplot(heart_disease, aes(y = cp, x = after_stat(count), fill = disease)) +
  geom_bar() +
  theme_stata()

A bar chart with a Stata theme.