In October 2023, James Hoffman and Cometeer held the “Great American Coffee Taste Test” on YouTube, asking viewers to fill out a survey and coffee ordered from Cometeer.
Data
The data is part of DSLCTidy Tuesday program where data sets are provided to help data science learners how to create graphics.
Information on the data sets variables (columns) can be found here.
Data
Code
library(csucistats)library(ggtricks)library(waffle)library(ggmosaic)library(tidyverse)library(ThemePark)library(DT)coffee <-read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-14/coffee_survey.csv")datatable(slice_sample(coffee, n =10), options =list(dom ="p", pageLength =5))
Categorical Data
The Great American Coffee Taste Test
Categorical Data
Continguency Tables
Bar Plots
Cross-Tabulation
Other Plots
Theming
Categorical Data
Categorical data are data recordings that represented a category.
Data may be recorded as a “character” or “string” data.
Data may be recorded as a whole number, with an attached code book indicating the categories each number belongs to.
Examples of Categorical Data
Are you a student?
What city do you live in?
What is your major?
Likert Scale
Likert scales are the rating systems you may have answered in surveys.
Strongly Disagree
Disagree
Neutral
Agree
Strongly Agree
Likert Scales
Likert scales may be treated as numerical data if the jumps between scales are equal.
Summarizing Categorical Data
Once we have the data, how do we summarize it to other people.
Continguency Tables
The Great American Coffee Taste Test
Categorical Data
Continguency Tables
Bar Plots
Cross-Tabulation
Other Plots
Theming
Continguency Tables
Continguency tables display how often a category is seen in the data.
There are two types of statistics that are reported in a table, the frequency and proportion.
Frequencey
Frequency represents the count of observing a specific category in your sample.
#> [1] "1" "1" "2" "More than 4" "2" "2" "2" "1" "More than 4" "2"
Proportions (relative frequencey)
Proportions represent the percentage that the category represents the sample.
This allows you to generalize your sample to the population, regardless of sample size.
Continguency Tables in R
The variable caffeine indicates how much caffeine a participant prefers.
ggplot(data = DATA, aes(x = VARIABLE, y =after_stat(prop), group =1)) +geom_bar()
Relative Frequency Bar Plots in R
Code
ggplot(coffee, aes(caffeine, after_stat(prop), group =1)) +geom_bar()
Cross-Tabulation
The Great American Coffee Taste Test
Categorical Data
Continguency Tables
Bar Plots
Cross-Tabulation
Other Plots
Theming
Data
The variable taste indicates if the participants like the taste of coffee.
cat_stats(coffee$taste)
Cross-Tabulation
Cross-tabulations, also known as contingency tables, are statistical tools used to analyze the relationship between two or more categorical variables by displaying their frequency distribution in a table format. Each cell in the table represents the count or frequency of observations that fall into a particular combination of categories for the variables.
Key Features of Cross-Tabulations
Rows and Columns:
Rows represent the categories of one variable.
Columns represent the categories of another variable.
Cells:
Each cell displays the frequency or count of data points that belong to the intersection of a row and column category.
Margins:
Row and column totals provide summaries for each variable.
The grand total shows the overall sample size.
Types of Proportions in Cross-Tabulations
Row Proportions: Show the percentage of each row total represented by a cell.
Column Proportions: Show the percentage of each column total represented by a cell.
Table Proportions: Show the percentage of the overall total represented by a cell.
Table Proportions
Table proportions in cross-tabulations refer to the relative frequency or percentage of counts within the entire table, calculated by dividing each cell’s count by the total sum of all counts in the table. These proportions allow you to examine the contribution of each cell to the overall data set.
Row proportions refer to the relative frequency or percentage of counts within each row of a contingency table. In a cross-tabulation, row proportions allow you to compare how the distribution of one variable varies within each category of another variable, within a row.
Column proportions refer to the relative frequency or percentage of counts within each column of a contingency table. These proportions allow you to compare how the distribution of one variable varies across different categories of another variable, within a column.
ggplot(DATA, aes(x = VAR1, y =after_stat(count), fill = VAR2)) +geom_bar()
Stacked Bar Plot in R
ggplot(coffee, aes(x = caffeine, y =after_stat(count), fill = taste)) +geom_bar()
Stacked Bar Plot in R
ggplot(coffee, aes(y = caffeine, x =after_stat(count), fill = taste)) +geom_bar()
Other Plots
The Great American Coffee Taste Test
Categorical Data
Continguency Tables
Bar Plots
Cross-Tabulation
Other Plots
Theming
Pie Charts
A pie chart is a circular statistical graphic divided into slices, where each slice represents a proportion or percentage of the whole. The size of each slice is proportional to the relative frequency or magnitude of the category it represents.
Pie Charts
Key Features of Pie Charts
Circular Format:
The chart is shaped like a circle, symbolizing a whole (100% or 1).
Slices:
Each slice corresponds to a category and its size represents the contribution of that category to the total.
Labels:
Slices are often labeled with the category name and the percentage or value they represent.
Pie Chart in R
df_pie <-cat_stats(coffee$caffeine, tbl_df =TRUE)$tableggplot(df_pie, aes(cat = Category, val = n, fill = Category)) +geom_pie()
Pie Chart in R
coffee_pie <-cat_stats(coffee$caffeine, tbl_df =TRUE)$tableggplot(coffee_pie, aes(cat = Category, val = n, fill = Category)) +geom_pie()
Mosaic Plots
A mosaic plot is a graphical representation of two categorical variables. It uses rectangles to visualize the proportions of data categories while simultaneously showing the relationships between multiple variables. The size of each rectangle corresponds to the relative frequency or proportion of the data in a particular category combination.
Key Features of Mosaic Plots
Rectangular Tiles:
The plot is divided into rectangles, with each tile representing a unique combination of categories.
Proportional Areas:
The area of each rectangle is proportional to the frequency or proportion of the data it represents.
Hierarchical Arrangement:
Variables are arranged hierarchically along the axes, with the tiles subdivided to represent relationships between variables.
Color Coding (Optional):
Different colors can be used to highlight specific patterns, emphasize groups, or indicate significance.
Mosiac Plots
Mosiac Plots in R
ggplot(DATA) +geom_mosaic(aes(x =product(VAR1, VAR2), fill = VAR2))
Mosaic Example
ggplot(coffee) +geom_mosaic(aes(x =product(caffeine, taste), fill = taste))
Waffle Charts
A waffle chart is a grid-based visualization used to display proportions or percentages in a dataset. It represents parts of a whole by dividing a grid into small squares, where each square corresponds to a specific percentage or unit of the total.
Key Features of Waffle Charts
Grid Structure:
Composed of small squares arranged in rows and columns, typically forming a 10x10 grid (100 squares for 100%).
Proportional Representation:
Each square represents an equal portion of the total, such as 1% in a 10x10 grid.
Color Coding:
Different colors are used to represent different categories or groups.