Inference with Linear Regression

2025-04-21

Motivating Example

Motivating Example
Mathematical Models
\(\beta\) Hypothesis Testing
Model Hypothesis Testing
Model Assumptions

Motivating Example

Code

p1 <- penguins |> ggplot(aes(x=species, y = body_mass_g)) +
  geom_jitter() + 
  geom_boxplot() + 
  labs(x = "Species", y = "Body Mass")
  
p2 <- penguins |> ggplot(aes(x=flipper_length_mm, y = body_mass_g)) +
  geom_point() + 
  labs(x = "Flipper Length", y = "Body Mass")  

p1 + p2

Mathematical Models

Motivating Example
Mathematical Models
\(\beta\) Hypothesis Testing
Model Hypothesis Testing
Model Assumptions

Mathematical Models

Standard Normal Distribution

\[ {\frac{1}{\sqrt{2 \pi}}} e^{-\frac{1}{2}x^2} \]

Code

data.frame(x = seq(-5,5, length.out = 100), 
           y1 = dt(seq(-5,5, length.out = 100), 1),
           y2 = dt(seq(-5,5, length.out = 100), 10),
           y3 = dt(seq(-5,5, length.out = 100), 30),
           y4 = dt(seq(-5,5, length.out = 100), 100),
           y5 = dnorm((seq(-5,5, length.out = 100)))) |> 
  ggplot() +
  # geom_line(aes(x, y1, color = "1")) +
  # geom_line(aes(x, y2, color = "10")) +
  # geom_line(aes(x, y3, color = "30")) +
  # geom_line(aes(x, y4, color = "100")) +
  geom_line(aes(x, y5)) +
  ylab("y")

t Distribution

\[ \frac{\Gamma \left(\frac{v+1}{2}\right)}{\sqrt{\pi v}\Gamma\left(\frac{v}{2}\right)} \left(1 + \frac{x^2}{v}\right)^{-\frac{v+1}{2}} \]

Code

data.frame(x = seq(-5,5, length.out = 100), 
           y1 = dt(seq(-5,5, length.out = 100), 1),
           y2 = dt(seq(-5,5, length.out = 100), 10),
           y3 = dt(seq(-5,5, length.out = 100), 30),
           y4 = dt(seq(-5,5, length.out = 100), 100),
           y5 = dnorm((seq(-5,5, length.out = 100)))) |> 
  ggplot() +
  geom_line(aes(x, y1, color = "1")) +
  geom_line(aes(x, y2, color = "10")) +
  geom_line(aes(x, y3, color = "30")) +
  geom_line(aes(x, y4, color = "100")) +
  geom_line(aes(x, y5, color = "Normal")) +
  ylab("y")

\(\beta\) Hypothesis Testing

Motivating Example
Mathematical Models
\(\beta\) Hypothesis Testing
Model Hypothesis Testing
Model Assumptions

Hypothesis

\[H_0: \beta = \theta\]

\[H_0: \beta \ne \theta\]

Testing \(\beta_j\)

\[ T = \frac{\hat\beta_j-\theta}{\mathrm{se}(\hat\beta_j)} \sim t_{n-p^\prime} \]

\(n\): sample size
\(p^\prime\): number of \(\beta\)s

P-Value

Alternative Hypothesis	p-value
\(\beta>\theta\)	\(P(\hat\beta >T)=p\)
\(\beta<\theta\)	\(P(\hat\beta < T)=p\)
\(\beta\ne\theta\)	\(2\times P(\hat\beta >\|T\|)=p\)

Confidence Intervals

\[ \hat \beta_j \pm CV \times se(\hat\beta_j) \]

\(CV\): Critical Value \(P(X<CV) = 1-\alpha/2\)
\(\alpha\): significance level
\(se\): Standard Error Function

Conducting HT of \(\beta_j\)

Code

xlm <- lm(Y ~ X, data = DATA)
summary(xlm)

Example

Code

m1 <- lm(body_mass_g ~ species + flipper_length_mm, penguins)
summary(m1)

#> 
#> Call:
#> lm(formula = body_mass_g ~ species + flipper_length_mm, data = penguins)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -898.8 -252.0  -24.8  229.8 1191.6 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       -4013.18     586.25  -6.846 3.74e-11 ***
#> speciesChinstrap   -205.38      57.57  -3.568 0.000414 ***
#> speciesGentoo       284.52      95.43   2.981 0.003083 ** 
#> flipper_length_mm    40.61       3.08  13.186  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 373.3 on 329 degrees of freedom
#> Multiple R-squared:  0.787,  Adjusted R-squared:  0.7851 
#> F-statistic: 405.3 on 3 and 329 DF,  p-value: < 2.2e-16

Confidence Interval

Code

confint(xlm, level = LEVEL)

Example

Code

confint(m1, level = 0.90)

#>                           5 %        95 %
#> (Intercept)       -4980.19064 -3046.16714
#> speciesChinstrap   -300.33123  -110.41973
#> speciesGentoo       127.11143   441.93578
#> flipper_length_mm    35.52645    45.68588

Model Hypothesis Testing

Motivating Example
Mathematical Models
\(\beta\) Hypothesis Testing
Model Hypothesis Testing
Model Assumptions

Model inference

We conduct model inference to determine if different models are better at explaining variation. A common example is to compare a linear model (\(\hat Y=\hat\beta_0 + \hat\beta_1 X\)) to the mean of Y (\(\hat \mu_y\)). We determine the significance of the variation explained using an Analysis of Variance (ANOVA) table and F test.

Model Inference

Given 2 models:

\[ \hat Y = \hat\beta_0 + \hat\beta_1 X_1 + \hat\beta_2 X_2 + \cdots + \hat\beta_p X_p \]

\[ \hat Y = \bar y \]

Is the model with predictors do a better job than using the average?

ANOVA

ANOVA Table

Source	DF	SS	MS	F
Model	\(DFR=k-1\)	\(SSR\)	\(MSR=\frac{SSM}{DFR}\)	\(\hat F=\frac{MSR}{MSE}\)
Error	\(DFE=n-k\)	\(SSE\)	\(MSE=\frac{SSE}{DFE}\)
Total	\(TDF=n-1\)	\(TSS=SSR+SSE\)

\[ \hat F \sim F(DFR, DFE) \]

Conducting an ANOVA in R

Code

xlm <- lm(Y ~ X, data = DATA)
summary(xlm)

Example

Code

summary(m1)

#> 
#> Call:
#> lm(formula = body_mass_g ~ species + flipper_length_mm, data = penguins)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -898.8 -252.0  -24.8  229.8 1191.6 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       -4013.18     586.25  -6.846 3.74e-11 ***
#> speciesChinstrap   -205.38      57.57  -3.568 0.000414 ***
#> speciesGentoo       284.52      95.43   2.981 0.003083 ** 
#> flipper_length_mm    40.61       3.08  13.186  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 373.3 on 329 degrees of freedom
#> Multiple R-squared:  0.787,  Adjusted R-squared:  0.7851 
#> F-statistic: 405.3 on 3 and 329 DF,  p-value: < 2.2e-16

Model Inference

Model inference can be extended to compare models that have different number of predictors.

Model Inference

Given:

\[ M1:\ \hat y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 \]

\[ M2:\ \hat y = \beta_0 + \beta_1 X_1 \]

Let \(M1\) be the FULL (larger) model, and let \(M2\) be the RED (Reduced, smaller) model.

Model Inference

He can test the following Hypothesis:

\(H_0\): The error variations between the FULL and RED model are not different.
\(H_1\): The error variations between the FULL and RED model are different.

Test Statistic

\[ \hat F = \frac{[SSE(RED) - SSE(FULL)]/[DFE(RED)-DFE(FULL)]}{MSE(FULL)} \]

\[ \hat F \sim F[DFE(RED) - DFE(FULL), DFE(FULL)] \]

ANOVA in R

Code

full <- lm(Y  ~  X1 + X2 + X3 + X4)
red <- lm(Y ~ X1 + X2)
anova(red, full)

Example

Code

m1 <- lm(body_mass_g ~ species + island + flipper_length_mm, penguins)
m2 <- lm(body_mass_g ~ island + flipper_length_mm, penguins)
anova(m2, m1)

#> Analysis of Variance Table
#> 
#> Model 1: body_mass_g ~ island + flipper_length_mm
#> Model 2: body_mass_g ~ species + island + flipper_length_mm
#>   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
#> 1    329 47774435                                  
#> 2    327 45552857  2   2221579 7.9738 0.0004157 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Assumptions

Motivating Example
Mathematical Models
\(\beta\) Hypothesis Testing
Model Hypothesis Testing
Model Assumptions

Model Assumptions

When we are conducting inference linear regression, we will have to check the following conditions:

Linearity
Independence
Normality
Equal Variances
Multicollinearity (for MLR)

Linearity

Probably considered the most important assumption, but there must be a linear relationship between both the outcome variable (y) and a set of predictors (\(x_1\), \(x_2\), …).

Independence

The data points must not influence each other.

Normality

The model errors (also known as residuals) must follow a normal distribution.

Equal Variances

The variability of the data points must be the same for all predictor values.

Residuals

Residuals are the errors between the observed value and the estimated model. Common residuals include

Raw Residual
Standardized Residual
Jackknife (studentized) Residuals

Influential Measurements

Influential measures are statistics that determine how much a data point affects the model. Common influential measures are

Leverages
Cook’s Distance

Raw Residuals

\[ \hat r_i = y_i - \hat y_i \]

Residual Analysis

A residual analysis is used to test the assumptions of linear regression.

QQ Plot

A qq (quantile-quantile) plot will plot the estimated quantiles of the residuals against the theoretical quantiles from a normal distribution function. If the points from the qq-plot lie on the \(y=x\) line, it is said that the residuals follow a normal distribution.

Residual vs Fitted Plot

This plot allows you to assess the linearity, constant variance, and identify potential outliers. Create a scatter plot between the fitted values (x-axis) and the raw/standardized residuals (y-axis).

Variance Inflation Factor

The variance inflation factor is a measurement on how much variables are collinear with each other. A value greater than 10 is a cause for concern and action should be taken.

Residual Analysis in R

Use the resid_df function to obtain the residuals of a model.

Code

rdf <- resid_df(LM_OBJECT)

Residual vs Fitted Plot

Code

ggplot(RDF, aes(fitted, resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red")

QQ Plot

Code

ggplot(RDF, aes(sample = resid)) + 
  stat_qq() +
  stat_qq_line()

Example

Code

xlm <- lm(body_mass_g ~   island + species + flipper_length_mm,
          penguins)
dfxlm <- resid_df(xlm)

ggplot(dfxlm, aes(fitted, resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red")

ggplot(dfxlm, aes(sample = resid)) + 
  stat_qq() +
  stat_qq_line()