XLM <-glm(Y ~ X1 + X2 + ... + Xp, data = DATA, family =binomial())tidy(XLM)
XLM: Object where the model is stored
Y: Name of the outcome variable in DATA
X1, X2, …, Xp: predictor variables in DATA
DATA: Name of the data set
Examples
Hypothesis Tests: Multi Regression
Examples
Power Analysis
Model Conditions
Penguins Example
Is there a significant relationship between penguin body mass (outcome; body_mass) and flipper length (predictor; flipper_len), adjusting for species? Use the penguins data set to determine a significant association.
Penguins: Hypothesis
\(H_0\): There is no relationship between penguin body mass and flipper length, adjusting for penguin species (\(\beta_{flipper\_len} = 0\))
\(H_1\): There is a relationship between penguin body mass and flipper length, adjusting for penguin species (\(\beta_{flipper\_len} \ne 0\))
There is a significant association between penguins flipper length and body mass, after adjusting for species (p < 0.0001; \(\beta = 40.7\)). As flipper length increases by 1 unit, body mass increases by 40.7 units, adjusting for penguin species.
Heart Disease Example
Is there a significant association between heart disease (outcome; disease) and resting blood pressure (predictor; trestbps), adjusting for chest pain (cp). Use the heart_disease data set to determine a significant association.
Heart: Hypothesis
\(H_0\): There is no relationship between heart disease probability and resting blood pressure, adjusting for chest pain (\(\beta_{bp} = 0\))
\(H_1\): There is no relationship between heart disease probability and resting blood pressure, adjusting for chest pain (\(\beta_{bp} \ne 0\))
Heart: \(\alpha\)-level
\[
\alpha = 0.05 = 5.0*10^{-2} = 5.0e-2
\]
Heart: Code
Code
m2 <-glm(disease ~ trestbps + cp, heart_disease, family =binomial())tidy(m2)
There is a significant association between heart disease and resting blood pressure, after adjusting for chest pain (p < 0.00967; \(\beta = 0.0209\)). As resting blood pressure increases by 1 unit, the odds of having heart disease increases by a factor of 1.02, adjusting for chest pain.
Power Analysis
Hypothesis Tests: Multi Regression
Examples
Power Analysis
Model Conditions
What is Statistical Power
Statistical Power is the probability of correctly rejecting a false null hypothesis.
In other words, it’s the chance of detecting a real effect when it exists.
Why Power Matters
Low power → high risk of Type II Error (false negatives)
High power → better chance of finding true effects
Common threshold: 80% power
Errors in Inference
Type I
Reject \(H_0\) when true
False positive
Type II
Don’t reject \(H_0\) when false
False negative
Power
\(1 - P(\text{Type II})\)
Detecting a true effect
Type I Error (False Positive)
Rejecting\(H_0\) when it is actually true
Probability = \(\alpha\) (significance level)
Type II Error (False Negative)
Failing to reject\(H_0\) when it is actually false
Probability = \(\beta\)
Power = \(1 - \beta\)
Balancing Errors
Lowering \(\alpha\) reduces Type I errors, but increases risk of Type II errors.
To reduce both:
Increase sample size
Use more appropriate statistical tests
What Affects Power?
Effect Size
Bigger effects are easier to detect
Sample Size (\(n\))
Larger samples reduce standard error
Significance Level (\(\alpha\))
Higher \(\alpha\) increases power (but riskier!)
Variability
Less noise in data = better power
Boosting Power
Power = Probability of rejecting \(H_0\) when it’s false
Helps avoid Type II Errors
Driven by:
Sample size
Effect size
\(\alpha\)
Variability
Aim for 80% or higher
Model Conditions
Hypothesis Tests: Multi Regression
Examples
Power Analysis
Model Conditions
Model Conditions
When we are conducting inference with regression models, we will have to check the following conditions:
Linearity
Independence
Probability Assumption
Equal Variances
Multicollinearity (for Multi-Regression)
Linearity
There must be a linear relationship between both the outcome variable (y) and a set of predictors (\(x_1\), \(x_2\), …).
Independence
The data points must not influence each other.
Probability Assumption
The model errors (also known as residuals) must follow a specified distribution.
Linear Regression: Normal Distribution
Logistic Regression: Binomial Distribution
Equal Variances
The variability of the data points must be the same for all predictor values.
Residuals
Residuals are the errors between the observed value and the estimated model. Common residuals include
Raw Residual
Standardized Residuals
Jackknife (studentized) Residuals
Deviance Residuals
Quantized Residuals
Influential Measurements
Influential measures are statistics that determine how much a data point affects the model. Common influential measures are
Leverages
Cook’s Distance
Raw Residuals
\[
\hat r_i = y_i - \hat y_i
\]
Residual Analysis
A residual analysis is used to test the assumptions of linear regression.
QQ Plot
A qq (quantile-quantile) plot will plot the estimated quantiles of the residuals against the theoretical quantiles from a normal distribution function. If the points from the qq-plot lie on the \(y=x\) line, it is said that the residuals follow a normal distribution.
Residual vs Fitted Plot
This plot allows you to assess the linearity, constant variance, and identify potential outliers. Create a scatter plot between the fitted values (x-axis) and the raw/standardized residuals (y-axis).
Residual Analysis in R
Use the resid_df function to obtain the residuals of a model.
Code
rdf <-resid_df(OBJECT)
Residual vs Fitted Plot
Linear
ggplot(RDF, aes(fitted, resid)) +geom_point() +geom_hline(yintercept =0, col ="red")
Logistic
ggplot(RDF, aes(fitted, quantile_resid)) +geom_point() +geom_hline(yintercept =0, col ="red")