Linear Regression Models

Explaining variation, simple & multiple linear models, categorical predictors, correlation, and prediction

1 Setup

1.1 Google Colab

Copy the following code and put it in a code cell in Google Colab. Only do this if you are using a completely new notebook.

# This code will load the R packages we will use
install.packages(c("rcistats", "taylor",,
                 repos = c("https://inqs909.r-universe.dev", 
                           "https://cloud.r-project.org")))


library(rcistats)
library(tidyverse)
library(taylor)

1.2 Using the templates: what to change

  • DATA → your data frame/tibble (e.g., penguins)
  • Y → the outcome variable (e.g., body_mass_g)
  • X, X1, X2, …, Xp → predictor variables (e.g flipper_length_mm, species)
  • For categorical predictors: ensure the predictor is a factor (use factor(X) if needed).

2 Visualizing Variables

2.1 Scatter Plot (Continuous vs Continuous)

A scatter plot reveals association, trend direction, and form.

Template:

ggplot(DATA, aes(x = VAR1, y = VAR2)) +
  geom_point()

Example:

ggplot(penguins, aes(x = flipper_len, y = body_mass)) +
  geom_point()

A scatter plot that shows the relationship between flipper length and body mass. The data points show a positive relationship,  and possibly two groups.

2.2 Multi Densigy Plot (Continuous vs Categorical)

A density plot is a way to visualize the distribution of a continuous variable — it shows where data values are concentrated (dense) and where they are sparse.

Template:

ggplot(DATA, aes(x = NUM, color = CAT)) +
  geom_density()

Example:

ggplot(penguins, aes(x = flipper_len, color = species)) +
  geom_density()

Kernel density plot showing the distribution of penguin flipper lengths by species. Adelie penguins have shorter flippers centered around 190 mm, Chinstrap penguins are intermediate around 195 mm, and Gentoo penguins have much longer flippers centered around 215–220 mm,  with little overlap between Gentoo and the other species.

Template:

ggplot(DATA, aes(x = NUM, fill = CAT)) +
  geom_density()

Example:

ggplot(penguins, aes(x = flipper_len, fill = species)) +
  geom_density()

Filled kernel density plot showing the distribution of penguin  flipper lengths by species. Adelie penguins have shorter flippers centered around 190 mm, Chinstrap penguins are intermediate  around 195 mm, and Gentoo penguins have much longer  flippers centered around 215–220 mm,  with little overlap between Gentoo and the other species.

Template:

ggplot(DATA, aes(x = NUM, group = CAT)) +
  geom_density()

Example:

ggplot(penguins, aes(x = flipper_len, group = species)) +
  geom_density()

Kernel density plot showing the distribution of penguin flipper lengths by species. Adelie penguins have shorter flippers centered around 190 mm, Chinstrap penguins are intermediate around 195 mm, and Gentoo penguins have much longer flippers centered around 215–220 mm,  with little overlap between Gentoo and the other species.

2.3 Multi Box Plot (Continuous vs Categorical)

A box plot summarizes median, quartiles, and potential outliers.

Template:

ggplot(DATA, aes(x = NUM, y = CAT)) +
  geom_boxplot()

Example:

ggplot(penguins, aes(x = flipper_len, y = species)) +
  geom_boxplot()

Horizontal boxplots of penguin flipper lengths (mm) by species. Adelie penguins have the shortest flippers with a median around  190 mm and a few outliers; Chinstrap penguins are intermediate with  a median near 195 mm; and Gentoo penguins have the longest flippers  with a median around 215–220 mm. The distributions show clear  separation between Gentoo and the other species.

3 Modeling Relationships

3.1 Explaining variation

This is the process of trying to reduce unexplained variation in an outcome by using informative predictors — think getting it less wrong with an educated guess.

ggplot(penguins, aes(body_mass)) +
  geom_density()
Kernel density plot of penguin body mass (grams).  The distribution peaks around 3,600–3,800 grams and gradually  tapers toward higher body masses, with a long right tail  extending beyond 6,000 grams.
Figure 1

If we know some information, we can get less wrong over time.

# Same variable, grouped by a category (species)
ggplot(penguins, aes(body_mass, fill = species)) +
  geom_density(alpha = .5)

Filled kernel density plot of penguin body mass (grams) by species.  Adelie penguins (red) and Chinstrap penguins (green) have lower  body masses, with overlapping distributions centered around 3,500–4,000  grams, while Gentoo penguins (blue) are substantially heavier,  with a distribution centered around 4,800–5,500 grams and  little overlap with the other species.

3.2 A simple model (intercept-only)

We believe the outcome (\(Y\)) is generated by an unknown data generated process (\(DGP_1\)): \(Y \sim DGP_1\). A minimal model says all outcomes are generated from a number (labeled as \(\beta_0\)) plus or minus some error (\(\varepsilon\)):
\[ Y = \beta_0 + \varepsilon \]

This is known as a simple model and the errors are simulated by a different DGP \(\varepsilon \sim DGP_2\). The simple model is unknown, so we construct and estimated simple model for our best guess on how the data is generated:

\[ \hat Y = \hat\beta_0 \]

The carats above the letters are known as hats, which means best guess.

Tip

Observed vs. estimated:

  • Observed \(Y = \beta_0 + \varepsilon\)
  • Estimated \(\hat Y = \hat\beta_0\)

3.2.1 Fitting Simple Model

Template:

lm(Y ~ 1, data = DATA)
  • DATA → your data (e.g., penguins).
  • Y → the outcome variable (e.g., body_mass_g).

Example:

# Fit the null (intercept-only) model
m0 <- lm(body_mass ~ 1, data = penguins)
m0
#> 
#> Call:
#> lm(formula = body_mass ~ 1, data = penguins)
#> 
#> Coefficients:
#> (Intercept)  
#>        4207

3.3 Linear regression model with one predictor

Model form:

\[ Y = \beta_0 + \beta_1 X + \varepsilon, \quad \hat Y = \hat\beta_0 + \hat\beta_1 X. \]

# Scatter plot
ggplot(penguins, aes(flipper_len, body_mass)) +
  geom_point()

A scatter plot that shows the relationship between flipper length and body mass. The data points show a positive relationship,  and possibly two groups.

# Add a least-squares line
ggplot(penguins, aes(flipper_len, body_mass)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)

A scatter plot that shows the relationship between flipper length and body mass. The data points show a positive relationship,  and possibly two groups. There is a line going between the  points showing a positive line.

# Fit the model
m1 <- lm(body_mass ~ flipper_len, data = penguins)
m1
#> 
#> Call:
#> lm(formula = body_mass ~ flipper_len, data = penguins)
#> 
#> Coefficients:
#> (Intercept)  flipper_len  
#>    -5872.09        50.15

Template:

lm(Y ~ X, data = DATA)

4 Categorical predictors (dummy variables)

When we use a categorical predictor in a regression, R needs to convert the categories into numbers. This is done using dummy variables (also called indicator variables).

4.1 General idea

  • Suppose we have a categorical variable with C categories.
  • We create C − 1 dummy variables, each taking value:
    • 1 if the observation belongs to that category
    • 0 otherwise
  • The category without a dummy variable becomes the reference (baseline) group.

Example: Penguin species:

The species variable has 3 categories: Adelie, Chinstrap, Gentoo.

We create two dummy variables:

  • \(D_1\): 1 if Chinstrap, 0 otherwise
  • \(D_2\): 1 if Gentoo, 0 otherwise
  • Adelie is the reference (when both \(D_1 = 0, D_2 = 0\))
Species \(D_1\) (Chinstrap) \(D_2\) (Gentoo)
Adelie 0 0
Chinstrap 1 0
Gentoo 0 1

4.2 Regression model with dummy variables

If we model penguin body mass (\(Y\)) with species:

\[ \hat Y_i = \beta_0 + \beta_1 D_{1i} + \beta_2 D_{2i} \]

  • \(\beta_0\): mean of Adelie (reference group)
  • \(\beta_1\): difference in mean body mass between Chinstrap and Adelie
  • \(\beta_2\): difference in mean body mass between Gentoo and Adelie

Predictions:
- Adelie: \(\hat Y = \beta_0\)
- Chinstrap: \(\hat Y = \beta_0 + \beta_1\)
- Gentoo: \(\hat Y = \beta_0 + \beta_2\)

4.3 R Implementation

R automatically creates dummy variables when you use a factor in lm().
The first level of the factor is used as the reference group (by default).

# Fit model with species (factor) as predictor
m <- lm(body_mass ~ species, data = penguins)
summary(m)
#> 
#> Call:
#> lm(formula = body_mass ~ species, data = penguins)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1142.44  -342.44   -33.09   307.56  1207.56 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       3706.16      38.14  97.184   <2e-16 ***
#> speciesChinstrap    26.92      67.65   0.398    0.691    
#> speciesGentoo     1386.27      56.91  24.359   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 460.8 on 330 degrees of freedom
#> Multiple R-squared:  0.6745, Adjusted R-squared:  0.6725 
#> F-statistic: 341.9 on 2 and 330 DF,  p-value: < 2.2e-16

Templates:

lm(Y ~ X, data = DATA)           # if X is already a factor
lm(Y ~ factor(X), data = DATA)   # force X to be treated as factor

4.4 Group Statistics

The function num_by_cat_stats() quickly computes descriptive statistics of a numerical variable grouped by a categorical variable.

Template:

num_by_cat_stats(DATA, NUM, CAT)
  • DATA: the data frame (e.g., penguins)
  • NUM: the numerical variable you want to summarize (e.g., body_mass_g)
  • CAT: the categorical variable that defines groups (e.g., species)

Example:

num_by_cat_stats(penguins, body_mass, species)
#>   Categories  min    q25     mean median  q75  max      sd      var   iqr
#> 1     Adelie 2850 3362.5 3706.164   3700 4000 4775 458.620 210332.4 637.5
#> 2  Chinstrap 2700 3487.5 3733.088   3700 3950 4800 384.335 147713.5 462.5
#> 3     Gentoo 3950 4700.0 5092.437   5050 5500 6300 501.476 251478.3 800.0
#>   missing
#> 1       0
#> 2       0
#> 3       0

5 Strength & correlation

Correlation (two numerical variables): \(-1 \le r \le 1\). For simple linear regression, \(R^2 = r^2\).

cor(penguins$body_mass, penguins$flipper_len)
#> [1] 0.8729789
# R^2 helper
r2(lm(body_mass ~ flipper_leng, data = penguins))
#> Error:
#> ! object 'flipper_leng' not found

Template:

# Correlation
cor(DATA$Y, DATA$X)

## R-Squared
xlm <- lm(Y ~ X, data = DATA); 
r2(xlm)

6 Prediction

Model: \(\hat Y = \hat\beta_0 + \hat\beta_1 X\). Supply new X to get a prediction \(\hat Y\).

Template:

m <- lm(Y ~ X, data = DATA)
ndf <- data.frame(X = VAL)
predict(m, newdata = ndf)

Examples:

xlm1 <- lm(body_mass ~ species, data = penguins)
predict(xlm1, newdata = data.frame(species = "Gentoo"))
#>        1 
#> 5092.437
xlm2 <- lm(body_mass ~ flipper_len, data = penguins)
predict(xlm2, newdata = data.frame(flipper_len = 190))
#>        1 
#> 3657.028

7 Multiple linear regression (MLR)

Model: \(Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \varepsilon\)

Template:

lm(Y ~ X1 + X2 + ... + Xp, data = DATA)

Example: danceability ~ mode_name + valence + energy

# Example: danceability ~ mode_name + valence + energy
lm(danceability ~ mode_name + valence + energy, data = taylor_album_songs)
#> 
#> Call:
#> lm(formula = danceability ~ mode_name + valence + energy, data = taylor_album_songs)
#> 
#> Coefficients:
#>    (Intercept)  mode_nameminor         valence          energy  
#>        0.53816         0.08386         0.17554        -0.05944

7.1 Adjusted \(R^2\)

\[R^2 = 1 - \frac{\text{Var(resid)}}{\text{Var}(Y)}, \quad R^2_{adj} = 1 - \Big(\frac{\text{Var(resid)}}{\text{Var}(Y)}\Big) \cdot \frac{n-1}{n-k-1}\]

m <- lm(danceability ~ mode_name + valence + energy, data = taylor_album_songs)
ar2(m)
#> [1] 0.09148268

8 Appendix: quick templates (copy‑paste)

8.1 Simple Model

lm(Y ~ 1, data = DATA)
  • DATA → your data frame (e.g., penguins)
  • Y → the outcome variable (e.g., body_mass)

8.2 Simple linear regression

lm(Y ~ X, data = DATA)
  • DATA → your data frame (e.g., penguins)
  • Y → the outcome variable (e.g., body_mass)
  • X → predictor variables (e.g flipper_length_mm)

8.3 SLR: Categorical predictor

lm(Y ~ X, data = DATA) 
  • DATA → your data frame (e.g., penguins)
  • Y → the outcome variable (e.g., body_mass)
  • X → predictor variables (e.g species)

8.4 SLR: Categorical predictor (Create Factor)

lm(Y ~ factor(X), data = DATA)   # coerce X to factor
  • DATA → your data frame (e.g., penguins)
  • Y → the outcome variable (e.g., body_mass)
  • X → predictor variables (e.g species)

8.5 Numerical Statistics by Categories

num_by_cat_stats(DATA, NUM, CAT)
  • DATA: the data frame (e.g., penguins)
  • NUM: the numerical variable you want to summarize (e.g., body_mass)
  • CAT: the categorical variable that defines groups (e.g., species)

8.6 Correlation

cor(DATA$Y, DATA$X, use = "complete.obs")
  • DATA → your data frame (e.g., penguins)
  • Y → the outcome variable (e.g., body_mass)
  • X → predictor variables (e.g flipper_len)

8.7 \(R^2\)

m <- lm(Y ~ X, data = DATA)
r2(m)
  • DATA → your data frame (e.g., penguins)
  • Y → the outcome variable (e.g., body_mass)
  • X → predictor variables (e.g flipper_len)

8.8 Multiple linear regression

lm(Y ~ X1 + X2 + ... + Xp, data = DATA)
  • DATA → your data frame (e.g., penguins)
  • Y → the outcome variable (e.g., body_mass)
  • X1, X2, …, Xp → predictor variables (e.g flipper_len, species)

8.9 Adjusted \(R^2\)

# Adjusted R^2
m <- lm(Y ~ X1 + X2 + ... + Xp, data = DATA)
ar2(m)
  • DATA → your data frame (e.g., penguins)
  • Y → the outcome variable (e.g., body_mass)
  • X1, X2, …, Xp → predictor variables (e.g flipper_len, species)

8.10 Prediction: SLR

m <- lm(Y ~ X, data = DATA)
ndf <- data.frame(X = VAL)
predict(m, newdata = ndf)
  • DATA → your data frame (e.g., penguins)
  • Y → the outcome variable (e.g., body_mass)
  • X → predictor variables (e.g flipper_len)

8.11 Prediction: MLR

m <- lm(Y ~ X1 + X2 + ... + Xp, data = DATA)
ndf <- data.frame(X1 = VAL1, X2 = VAL2, ..., Xp = VALp)
predict(m, newdata = ndf)
  • DATA → your data frame (e.g., penguins)
  • Y → the outcome variable (e.g., body_mass)
  • X1, X2, …, Xp → predictor variables (e.g flipper_len, species)
  • VAL1, VAL2, …, VALp → predictor values (e.g 150, "Gentoo")

8.12 Scatter Plot

ggplot(DATA, aes(x = VAR1, y = VAR2)) + 
    geom_point()
  • DATA → the name of your dataset (e.g., penguins).
  • VAR1 → a single numerical variable (e.g., body_mass).
  • VAR2 → a single numerical variable (e.g., flipper_len).

8.13 Scatter Plot + trend line

ggplot(DATA, aes(x = VAR1, y = VAR2)) + 
    geom_point() + 
    geom_smooth(method = "lm", se = TRUE)
  • DATA → the name of your dataset (e.g., penguins).
  • VAR1 → a single numerical variable (e.g., body_mass).
  • VAR2 → a single numerical variable (e.g., flipper_len).

8.14 Multiple Density Plot (Continous vs Categorical)

ggplot(DATA, aes(x = NUM, color = CAT)) + 
    geom_density() 
  • DATA → the name of your dataset (e.g., penguins).
  • NUM → a single numerical variable (e.g., body_mass).
  • CAT → a single categorical variable (e.g., species).

8.15 Multiple Box Plots (Continous vs Categorical)

ggplot(DATA, aes(x = NUM, y = CAT)) + 
    geom_boxplot() 
  • DATA → the name of your dataset (e.g., penguins).
  • NUM → a single numerical variable (e.g., body_mass).
  • CAT → a single categorical variable (e.g., species).