Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Strength and Correlation
Prediction
This is the process where we try to reduce the variation with the use of other variables.
Can be thought of as getting it less wrong when taking an educated guess.
Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Strength and Correlation
Prediction
\[ Y \sim DGP_1 \]
\[ Y = \_\_\_ + error \]
\[ Y = \ \ \ \ \ \ \ \ \ + \varepsilon \]
\[ Y \sim \beta_0 + \varepsilon \]
\[ \varepsilon \sim DGP_2 \]
\(DGP_2\) is not the same as the \(DGP_1\), it is transformed due \(\beta_0\). Consider this the NULL \(DGP\).
\[ Y = \beta_0 + \varepsilon \]
\[ \hat Y=\hat\beta_0 \]
\[ Y = \beta_0 + \varepsilon \]
\[ \hat Y = \hat \beta_0 \]
Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Strength and Correlation
Prediction
The data in a data set can be indexed by a number.
#> # A tibble: 1 × 6
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
#> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 39.1 18.7 181 3750 male 2007
Making the variable “body_mass_g” be represented by \(Y\) and “flipper_length_mm” as \(X\):
\[ Y_1 = 3750 \ \ X_1=181 \]
\[ Y_i, X_i \]
With the data that we collect from a sample, we hypothesize how the data was generated.
Using a simple model:
\[ Y_i = \beta_0 + \varepsilon_i \]
\[ \hat Y_i = \hat \beta_0 \]
To estimate \(\hat \beta_0\), we minimize the follow function:
\[ \sum^n_{i=1} (Y_i-\hat Y_i)^2 \]
This is known as the sum squared errors, SSE
The residuals are known as the observed errors from the data in the model:
\[ r_i = Y_i - \hat Y_i \]
Y: Name Outcome Variable of Interest in data frame DATADATA: Name of the data frame#>
#> Call:
#> lm(formula = body_mass_g ~ 1, data = penguins)
#>
#> Coefficients:
#> (Intercept)
#> 4207
\[ \hat Y = 4207 \]
Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Strength and Correlation
Prediction
The goal of Statistics is to develop models the have a better explanation of the outcome \(Y\).
In particularly, reduce the sum of squared errors.
By utilizing a bit more of information, \(X\), we can increase the predicting capabilities of the model.
Thus, the linear model is born.
\[ Y = \beta_0 + \beta_1 X + \varepsilon \]
\[ \varepsilon \sim DGP_3 \]
\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]
\[ \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i \]
Goal is to obtain numerical values for \(\hat \beta_0\) and \(\hat \beta_1\) that will minimize the SSE.
\[ \sum^n_{i=1} (Y_i-\hat Y_i)^2 \]
\[ \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i \]
X: Name Predictor Variable of Interest in data frame DATAY: Name Outcome Variable of Interest in data frame DATADATA: Name of the data frameY: “body_mass_g”; X: “flipper_length_mm”
#>
#> Call:
#> lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
#>
#> Coefficients:
#> (Intercept) flipper_length_mm
#> -5872.09 50.15
\[ \hat Y_i = -5872.09 + 50.15 X_i \]
The intercept \(\hat \beta_0\) can be interpreted as the base value when \(X\) is set to 0.
Some times the intercept can be interpretable to real world scenarios.
Other times it cannot.
\[ \hat Y_i = -5872.09 + 50.15 X_i \]
When flipper length is 0 mm, the body mass is -5872 grams.
The slope \(\hat \beta_1\) indicates how will y change when x increases by 1 unit.
It will demonstrate if there is, on average, a positive or negative relationship based on the sign provided.
\[ \hat Y_i = -5872.09 + 50.15 X_i \]
When flipper length increases by 1 mm, the body mass will increase by 50.15 grams.
Modeling Relationships
A Simple Model
Modelling Data
Linear Model
Strength and Correlation
Prediction
Correlation is a statistics that can be used to describe the strength of the relationship between 2 continuous variables.
\[ r = \frac{1}{n-1}\sum^n_{i=1}\frac{x_i - \bar x}{s_x}\frac{y_i - \bar y}{s_y} \]
\[ -1 \leq r \leq 1 \]
From IMS 2e
The coefficient of determination evaluates the strength between an outcome \(Y\) and the linear model, which includes \(X\).
\[ R^2 = r^2 \]
\[ 0 \leq R^2 \leq 1 \]
The coefficient of determination measures the total variation explained by the linear model. The closer to 1, the better the linear model.
X: Name Predictor Variable of Interest in data frame DATAY: Name Outcome Variable of Interest in data frame DATADATA: Name of the data frameX: Name Predictor Variable of Interest in data frame DATAY: Name Outcome Variable of Interest in data frame DATADATA: Name of the data frameModeling Relationships
A Simple Model
Modelling Data
Linear Model
Strength and Correlation
Prediction
\[ \hat Y = \hat \beta_0 + \hat \beta_1 X \]
Using the equation \(\hat Y\), we can give it a value of \(X\) and then, in return, a value of \(\hat Y\) that predicts the true value \(Y\).
X: Name Predictor Variable of Interest in data frame DATAY: Name Outcome Variable of Interest in data frame DATADATA: Name of the data frameVAL: Value for the Predictor VariablePredict the body mass for a gentoo penguin.
Predict the body mass for a penguin with a flipper length of 190.
#>
#> Call:
#> lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
#>
#> Coefficients:
#> (Intercept) flipper_length_mm
#> -5872.09 50.15
#> 1
#> 3657.028
Interpolation is the process of estimating a value within the range of the observed input data \(X\).
Extrapolation is the process of estimating a value beyond the range of observed input data \(X\). It’s about venturing into the unknown, using what we know as a guide.
