#> [1] 181 186 195 193 190 181 195
Palmer Penguins
Describing numerical data
Summary Statistics
Numerical Statistics in R
Data Visualization
Skeweness
Scatter Plots
The penguins data set was contains information on penguins from the Palmer Station located in . You can learn more about the data set here.

flipper_len: Flipper Length in millimetersbody_mass: Body mass in gramsPalmer Penguins
Describing numerical data
Summary Statistics
Numerical Statistics in R
Data Visualization
Skeweness
Scatter Plots
Palmer Penguins
Describing numerical data
Summary Statistics
Numerical Statistics in R
Data Visualization
Skeweness
Scatter Plots
Summary statistics are used to describe the distribution of data.
Central tendency is a statistical concept that refers to the central or typical value around which a set of data points tends to cluster. It is used to summarize and describe a data set by identifying a single representative value that provides insights into the data’s overall characteristics.
Variation in statistics refers to the extent to which data points in a dataset deviate or differ from a central tendency measure. Understanding variation is crucial for making informed decisions, drawing meaningful conclusions, and assessing the reliability of statistical analyses.
The minimum (min) is the smallest value in the data.
The maximum (max) is the largest value in the data.
Quartiles are three values (Q1, Q2, Q3) that divides the data into four subsets.

Q1 is the value signifying that a quarter of the data (25%) is lower than it.
Q2 is the value signifying that half of the data (50%) is below it.
The median also represents the central tendency of the data.
Q3 is the value signifying that 3 quarters (75%) of the data is below it.
\[ IQR = Q_3 - Q_1 \]
\[ R = \mathrm{max} - \mathrm{min} \]
Describe how you will find the mean of these numbers:
#> [1] 18 18 12 18 13
The mean is another measurement for central tendency.
\[ \bar X = \frac{1}{n}\sum^n_{i=1}X_i \]
\(n\): total data points
\(X_i\): data points
\(i\): indexing data
\(\sum\): add all from first (bottom) to last (up)
The variance is a measurement on the average squared distance the data points are from the central tendency.
\[ s^2 = \frac{1}{n-1}\sum^n_{i=1}(X_i-\bar X)^2 \]
The standard deviation is a measurement on the average distance the data points are from the central tendency.
\[ s=\sqrt{s^2} \]
\[ \bar X = \frac{1}{n}\sum^n_{i=1}x_i \]
\[ P(X \leq \tilde{X}) = 0.5 \]
Mean (blue line) vs Median (red line)
These are data points that seem to be highly distant from all other variables.

Palmer Penguins
Describing numerical data
Summary Statistics
Numerical Statistics in R
Data Visualization
Skeweness
Scatter Plots
R has several built in functions to compute statistics.
DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)Palmer Penguins
Describing numerical data
Summary Statistics
Numerical Statistics in R
Data Visualization
Skeweness
Scatter Plots
A histogram is a graphical representation of the distribution or frequency of data points in a dataset. It provides a visual way to understand the shape, central tendency, and spread of a dataset by dividing the data into intervals or bins and showing how many data points fall into each bin as a bar.
To change bins:
DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)VAL: Numerical value to change the bin width.A box plot, also known as a box-and-whisker plot, is a graphical representation of the distribution and key statistical characteristics of a dataset. It provides a visual summary of the data’s central tendency, spread, and potential outliers.
DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)Dot Plots are similar to histograms, but they incorporate dots to count how many data points fall within bins.
To change binwidth
DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)VAL: Numerical value to change the bin width.A density plot is a way to visualize the distribution of a continuous variable — it shows where data values are concentrated (dense) and where they are sparse via the height of the graph.
DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)XVAL: Number to place the vertical line (eg: 5)DATA: Name of the data frame (eg: penguins)VAR: Name of the variable to create a plot (eg: flipper_len)YVAL: Number to place the horizontal line (eg: 2)Palmer Penguins
Describing numerical data
Summary Statistics
Numerical Statistics in R
Data Visualization
Skeweness
Scatter Plots
Skewness is a statistical measure to determine if unimodal data follows a symmetric distribution, skewed to the left, or skewed to the right.
A symmetric distribution will look bell shaped and the mean (red line) and median (dashed blue line) will overlap each other.
A right skewed distribution looks asymetric and the mean (red line) is to the right of the median (dashed blue line).
A left skewed distribution looks asymetric and the mean (red line) is to the left of the median (dashed blue line).
Palmer Penguins
Describing numerical data
Summary Statistics
Numerical Statistics in R
Data Visualization
Skeweness
Scatter Plots
Scatter plots demonstrate how two variables behave with each other. They can tell you any postive or negative trends, if they exist, with the combination of the plots.
DATA: Name of the data frame (eg: penguins)VAR1: Name of the X variable to create a plot (eg: flipper_len)VAR2: Name of the Y variable to create a plot (eg: body_mass)
m201.inqs.info/lectures/3