Simple Linear Regression

2025-02-25

Modeling Relationships

Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Categorical Variables
Strength and Correlation
Prediction

Explaining Variation

This is the process where we try to reduce the variation with the use of other variables.

Can be thought of as getting it less wrong when taking an educated guess.

Explaining Variation

Code

ggplot(penguins, aes(body_mass_g)) +
  geom_density()

Variation with One Variable

Code

ggplot(penguins, aes(body_mass_g, fill = species)) +
  geom_density(alpha = .5)

A Simple Model

Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Categorical Variables
Strength and Correlation
Prediction

Generated Model

\[ Y \sim DGP_1 \]

A Simple Model

Code

ggplot(penguins, aes(body_mass_g)) +
  geom_density()

A Simple Model

\[ Y = \_\_\_ + error \]

Notation

\[ Y = \ \ \ \ \ \ \ \ \ + \varepsilon \]

The Simple Generated Model

\[ Y \sim \beta_0 + \varepsilon \]

\[ \varepsilon \sim DGP_2 \]

\(DGP_2\) is not the same as the \(DGP_1\), it is transformed due \(\beta_0\). Consider this the NULL \(DGP\).

Observing Data

\[ Y = \beta_0 + \varepsilon \]

Estimated Line

\[ \hat Y=\hat\beta_0 \]

Notation

Observed

\[ Y = \beta_0 + \varepsilon \]

Estimated

\[ \hat Y = \hat \beta_0 \]

Modelling Data

Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Categorical Variables
Strength and Correlation
Prediction

Indexing Data

The data in a data set can be indexed by a number.

penguins[1,-c(1:2)]

#> # A tibble: 1 × 6
#>   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex    year
#>            <dbl>         <dbl>             <int>       <int> <fct> <int>
#> 1           39.1          18.7               181        3750 male   2007

Making the variable “body_mass_g” be represented by \(Y\) and “flipper_length_mm” as \(X\):

\[ Y_1 = 3750 \ \ X_1=181 \]

Indexing Data

\[ Y_i, X_i \]

Data

With the data that we collect from a sample, we hypothesize how the data was generated.

Using a simple model:

\[ Y_i = \beta_0 + \varepsilon_i \]

Estimated Value

\[ \hat Y_i = \hat \beta_0 \]

Estimation

To estimate \(\hat \beta_0\), we minimize the follow function:

\[ \sum^n_{i=1} (Y_i-\hat Y_i)^2 \]

This is known as the sum squared errors, SSE

Residuals

The residuals are known as the observed errors from the data in the model:

\[ r_i = Y_i - \hat Y_i \]

Estimation in R

lm(Y ~ 1, data = DATA)

Y: Name Outcome Variable of Interest in data frame DATA
DATA: Name of the data frame

Modeling Body Mass in Penguins

lm(body_mass_g ~ 1, data = penguins)

#> 
#> Call:
#> lm(formula = body_mass_g ~ 1, data = penguins)
#> 
#> Coefficients:
#> (Intercept)  
#>        4207

\[ \hat Y = 4207 \]

Visualize

Code

ggplot(penguins, aes(body_mass_g)) +
  geom_density() +
  geom_vline(xintercept = 4207)

Linear Model

Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Categorical Variables
Strength and Correlation
Prediction

Linear Model

The goal of Statistics is to develop models the have a better explanation of the outcome \(Y\).

In particularly, reduce the sum of squared errors.

By utilizing a bit more of information, \(X\), we can increase the predicting capabilities of the model.

Thus, the linear model is born.

Code

ggplot(penguins, aes(body_mass_g)) +
  geom_density()

Code

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, fill = after_stat(level))) +
  stat_density_2d(geom = "polygon")

Linear Model

\[ Y = \beta_0 + \beta_1 X + \varepsilon \]

\[ \varepsilon \sim DGP_3 \]

Scatter Plot

Code

ggplot(penguins, aes(flipper_length_mm, body_mass_g)) + 
  geom_point()

Imposing a Line

Code

ggplot(penguins, aes(flipper_length_mm, body_mass_g)) + 
  geom_point() +
  stat_smooth(method = "lm", se = F)

Modelling the Data

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]

Linear Model

\[ \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i \]

Goal is to obtain numerical values for \(\hat \beta_0\) and \(\hat \beta_1\) that will minimize the SSE.

SSE

\[ \sum^n_{i=1} (Y_i-\hat Y_i)^2 \]

\[ \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i \]

Fitting a Model in R

lm(Y ~ X, data = DATA)

X: Name Predictor Variable of Interest in data frame DATA
Y: Name Outcome Variable of Interest in data frame DATA
DATA: Name of the data frame

Example

Y: “body_mass_g”; X: “flipper_length_mm”

lm(body_mass_g ~ flipper_length_mm, data = penguins)

#> 
#> Call:
#> lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
#> 
#> Coefficients:
#>       (Intercept)  flipper_length_mm  
#>          -5872.09              50.15

\[ \hat Y_i = -5872.09 + 50.15 X_i \]

Interpretation of \(\hat\beta_0\)

The intercept \(\hat \beta_0\) can be interpreted as the base value when \(X\) is set to 0.

Some times the intercept can be interpretable to real world scenarios.

Other times it cannot.

Interpreting Example

\[ \hat Y_i = -5872.09 + 50.15 X_i \]

When flipper length is 0 mm, the body mass is -5872 grams.

Interpretation of \(\hat \beta_1\)

The slope \(\hat \beta_1\) indicates how will y change when x increases by 1 unit.

It will demonstrate if there is, on average, a positive or negative relationship based on the sign provided.

Interpreting Example

\[ \hat Y_i = -5872.09 + 50.15 X_i \]

When flipper length increases by 1 mm, the body mass will increase by 50.15 grams.

Categorical Variables

Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Categorical Variables
Strength and Correlation
Prediction

Body Mass with Species

Code

ggplot(penguins, aes(body_mass_g)) +
  geom_boxplot()

Body Mass with Species

Code

ggplot(penguins, aes(body_mass_g, fill = species)) +
  geom_boxplot()

Group Statistics

We can use statistics to explain a continuous variable by the categories.

Compute statistics for each group.

num_by_cat_stats(DATA, NUM, CAT)

NUM: Name of the numerical variable
CAT: Name of the categorical variable
DATA: Name of the data frame

Compute Group Statistics

num_by_cat_stats(penguins, body_mass_g, species)

#>   Categories  min    q25     mean median  q75  max      sd      var   iqr
#> 1     Adelie 2850 3362.5 3706.164   3700 4000 4775 458.620 210332.4 637.5
#> 2  Chinstrap 2700 3487.5 3733.088   3700 3950 4800 384.335 147713.5 462.5
#> 3     Gentoo 3950 4700.0 5092.437   5050 5500 6300 501.476 251478.3 800.0
#>   missing
#> 1       0
#> 2       0
#> 3       0

LM with Categorical Variables

A line is normally used to model 2 continuous variables.

However, the predictor variable \(X\) can be restricted to a set a variables that can symbolize categories.

A category will be used as a reference for a model.

Binary (Dummy) Variables

Binary variables are variable that can only take on the value 0 or 1.

\[ D_i = \left\{ \begin{array}{cc} 1 & Category\\ 0 & Other \end{array} \right. \]

Binary (Dummy) Variables

To fit a model with categorical variables, we must utilize dummy (binary) variables that indicate which category is being referenced. We use \(C-1\) dummy variables where \(C\) indicates the number of categories. When coded correctly, each category will be represented by a combination of dummy variables.

Example

If we have 4 categories, we will need 3 dummy variables:

	Cat 1	Cat 2	Cat 3
Dummy 1	1	0	0
Dummy 2	0	1	0
Dummy 2	0	0	1

Species Dummy Variables

	Chinstrap	Gentoo	Adelie
\(D_1\)	1	0	0
\(D_2\)	0	1	0

Linear Model

\[ \hat Y_i = \hat \beta_0 + \hat\beta_1 D_{1i} + \hat\beta_2 D_{2i} \]

\(\hat \beta_1\) indicates how body mass changes from Adelie to Chinstrap.

\(\hat \beta_2\) indicates how body mass changes from Adelie to Gentoo.

\(\hat \beta_0\) represents the baseline level, in this case the body mass of Adelie.

Fitting a Model in R

lm(Y ~ X, data = DATA)

X: Name Predictor Variable of Interest in data frame DATA, must be a factor variable
Y: Name Outcome Variable of Interest in data frame DATA
DATA: Name of the data frame

X not a Factor

lm(Y ~ factor(X), data = DATA)

X: Name Predictor Variable of Interest in data frame DATA, not a factor variable
Y: Name Outcome Variable of Interest in data frame DATA
DATA: Name of the data frame

Example

lm(body_mass_g ~ species, penguins)

#> 
#> Call:
#> lm(formula = body_mass_g ~ species, data = penguins)
#> 
#> Coefficients:
#>      (Intercept)  speciesChinstrap     speciesGentoo  
#>          3706.16             26.92           1386.27

\[ \hat Y_i = 3706 + 26.92 D_{1i} + 1386.27 D_{2i} \]

Finding the Adelie MASS

\[ \hat Y_i = 3706 + 26.92 (0) + 1386.27 (0) \]

Finding the Chinstrap MASS

\[ \hat Y_i = 3706 + 26.92 (1) + 1386.27 (0) \]

Finding the Gentoo MASS

\[ \hat Y_i = 3706 + 26.92 (0) + 1386.27 (1) \]

Intepreting \(\hat \beta_1\)

On average, Chinstrap has a larger mass than Adelie by about 26.92 grams.

Intepreting \(\hat \beta_2\)

On average, Gentoo has a larger mass than Adelie by about 1386.27 grams.

Strength and Correlation

Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Categorical Variables
Strength and Correlation
Prediction

Correlation

Correlation is a statistics that can be used to describe the strength of the relationship between 2 continuous variables.

\[ r = \frac{1}{n-1}\sum^n_{i=1}\frac{x_i - \bar x}{s_x}\frac{y_i - \bar y}{s_y} \]

\(\bar x\), \(\bar y\): sample means
\(s_x\), \(s_y\): sample standard deviations

\[ -1 \leq r \leq 1 \]

Correlations

From IMS 2e

Coefficient of Determination

The coefficient of determination evaluates the strength between an outcome \(Y\) and the linear model, which includes \(X\).

\[ R^2 = r^2 \]

\[ 0 \leq R^2 \leq 1 \]

The coefficient of determination measures the total variation explained by the linear model. The closer to 1, the better the linear model.

Correlation in R

cor(DATA$Y, DATA$X)

X: Name Predictor Variable of Interest in data frame DATA
Y: Name Outcome Variable of Interest in data frame DATA
DATA: Name of the data frame

Example

cor(penguins$body_mass_g, penguins$flipper_length_mm)

#> [1] 0.8729789

Coefficient of Determination in R

xlm <- lm(Y ~ X, data = DATA)
r2(xlm)

X: Name Predictor Variable of Interest in data frame DATA
Y: Name Outcome Variable of Interest in data frame DATA
DATA: Name of the data frame

Example

xlm <- lm(body_mass_g ~ species, penguins)
r2(xlm)

#> [1] 0.6744887

Prediction

Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Categorical Variables
Strength and Correlation
Prediction

Statistical Model

\[ \hat Y = \hat \beta_0 + \hat \beta_1 X \]

\(X\): Input
\(\hat Y\): Output

Prediction

Using the equation \(\hat Y\), we can give it a value of \(X\) and then, in return, a value of \(\hat Y\) that predicts the true value \(Y\).

Prediction in R

xlm <- lm(Y ~ X,
            data = DATA)

predict_df <- data.frame(X = VAL)

predict(xlm,
        predict_df)

X: Name Predictor Variable of Interest in data frame DATA
Y: Name Outcome Variable of Interest in data frame DATA
DATA: Name of the data frame
VAL: Value for the Predictor Variable

Example 1

Example
Code

Predict the body mass for a gentoo penguin.

xlm <- lm(body_mass_g ~ species,
            data = penguins)

xlm

#> 
#> Call:
#> lm(formula = body_mass_g ~ species, data = penguins)
#> 
#> Coefficients:
#>      (Intercept)  speciesChinstrap     speciesGentoo  
#>          3706.16             26.92           1386.27

predict_df <- data.frame(species = "Gentoo")

predict(xlm,
        predict_df)

#>        1 
#> 5092.437

Example 2

Example
Code

Predict the body mass for a penguin with a flipper length of 190.

xlm <- lm(body_mass_g ~ flipper_length_mm,
            data = penguins)


xlm

#> 
#> Call:
#> lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
#> 
#> Coefficients:
#>       (Intercept)  flipper_length_mm  
#>          -5872.09              50.15

predict_df <- data.frame(flipper_length_mm = 190)

predict(xlm,
        predict_df)

#>        1 
#> 3657.028

Interpolation

Interpolation is the process of estimating a value within the range of the observed input data \(X\).

Extrapolation

Extrapolation is the process of estimating a value beyond the range of observed input data \(X\). It’s about venturing into the unknown, using what we know as a guide.

Extrapolation

Code

ggplot(penguins, aes(flipper_length_mm, body_mass_g)) + 
  xlim(160, 250) +
  ylim(2600, 7000) +
  geom_point() +
  stat_smooth(method = "lm", se = F)