---
title: "Logistic Regression Models"
description: |
The logistic regression model, interpreting coefficients (odds & odds ratios), fitting
models in R, and predicting probabilities.
format:
html:
toc: true
toc-depth: 3
number-sections: true
code-tools: true
code-fold: false
smooth-scroll: true
editor: source
image: img/logistic.png
execute:
echo: true
warning: false
message: false
error: true
jupyter: r
knitr:
opts_chunk:
comment: "#>"
---
```{r}
#| label: setup
#| include: false
library(tidyverse)
library(rcistats)
library(MASS)
# Outcome: 1 = died from Melanoma, 0 = did not
Melanoma$dead <- ifelse(Melanoma$status == 1, 1, 0)
```
# Google Colab
Copy the following code and put it in a code cell in Google Colab. Only do this if you are using a completely new notebook.
```r
# This code will load the R packages we will use
install.packages(c("rcistats"),
repos = c("https://inqs909.r-universe.dev",
"https://cloud.r-project.org"))
library(tidyverse)
library(csucistats)
library(MASS)
# Uncomment and run for themes
# csucistats::install_themes()
# library(ThemePark)
# library(ggthemes)
# Outcome: 1 = died from Melanoma, 0 = did not
Melanoma$dead <- ifelse(Melanoma$status == 1, 1, 0)
```
# Data
## Melanoma
Melanoma is a type of skin cancer arising from melanin‑producing cells. It is dangerous because it can metastasize to other parts of the body.
## Outcome of interest
We want to understand **how predictors affect survival** during a study period. We therefore code a **binary outcome** `dead`.
## Data
We use `MASS::Melanoma` with:- `dead` (1 = died of Melanoma, 0 = otherwise),- `sex` (1 = male, 0 = female),- `age` (years),- `thickness` (tumour thickness in mm),- `ulcer` (1 = present, 0 = absent).
## Plot
```{r}
#| code-fold: true
ggplot(Melanoma, aes(thickness, dead)) +
geom_point(alpha = 0.7) +
labs(x = "Tumour thickness (mm)", y = "Dead (1=yes, 0=no)") +
stat_smooth(method = "glm",
se = F,
method.args = list(family = "binomial"),
color = "blue") +
theme_bw()
```
# Logistic regression in R
**Logistic regression** models the relationship between predictors and a **binary** outcome by making the **log‑odds** a linear function of the predictors.
## Fitting Model
**Template:**
```r
# Logistic regression (binomial GLM)
glm(Y ~ X1 + X2 + ... + Xp,
data = DATA,
family = binomial())
```
**Example (Melanoma):**
Model `dead` by `sex`, `age`, `thickness`, and `ulcer`:
```{r}
glm(dead ~ sex + age + thickness + ulcer,
data = Melanoma,
family = binomial())
```
The fitted logit equation can be written generically as
## Odds ratios (exponentiated coefficients)
**Template:**
```r
# Logistic regression (binomial GLM)
m <- glm(Y ~ X1 + X2 + ... + Xp,
data = DATA,
family = binomial())
exp(coef(m))
```
**Example:**
```{r}
m <- glm(dead ~ sex + age + thickness + ulcer,
data = Melanoma,
family = binomial())
exp(coef(m))
```
# Prediction with models
## Model
$$
\hat P(Y=1) = \frac{e^{\hat\beta_0 + \hat\beta_1 X_1 + \cdots + \hat\beta_p X_p}}{1 + e^{\hat\beta_0 + \hat\beta_1 X_1 + \cdots + \hat\beta_p X_p}}.
$$
**Template:**
```r
xglm <- glm(Y ~ X1 + X2 + ... + Xp,
data = DATA,
family = binomial())
new_df <- data.frame(X1 = VAL1, X2 = VAL2, ..., Xp = VALp)
predict(xglm, newdata = new_df, type = "response") # probabilities
```
**Examples (Melanoma):**
0) **Fit a model with gender, age, thickness, and ulcer present**
```{r}
xglm <- glm(dead ~ sex + age + thickness + ulcer,
data = Melanoma,
family = binomial())
```
1) **Male, age 75, thickness 2.9, ulcer present**
```{r}
new1 <- data.frame(sex = 1, age = 75, thickness = 2.9, ulcer = 1)
predict(xglm, new1, type = "response")
```
2) **Male, age 75, thickness 2.9, ulcer absent**
```{r}
new2 <- data.frame(sex = 1, age = 75, thickness = 2.9, ulcer = 0)
predict(xglm, new2, type = "response")
```
3) **Female, thickness 2.9, ulcer present — compare ages 55 vs 75**
```{r}
new3 <- tibble(sex = 0, age = c(55, 75), thickness = 2.9, ulcer = 1)
pred3 <- predict(xglm, new3, type = "response")
tibble(age = new3$age, probability = scales::percent(pred3, accuracy = 0.1))
```
# Appendix: quick reference (copy‑paste)
## Simple Logistic Regression
```r
# Fit
glm(Y ~ X, data = DATA, family = binomial())
```
- **`DATA`** → your data frame (e.g., `Melanoma`)
- **`Y`** → the outcome variable (e.g., `dead`)
- **`X`** → predictor variables (e.g `thickness`)
## Logistic Regression
```r
# Fit
glm(Y ~ X1 + X2 + ... + Xp, data = DATA, family = binomial())
```
- **`DATA`** → your data frame (e.g., `Melanoma`)
- **`Y`** → the outcome variable (e.g., `dead`)
- **`X1`, `X2`, ..., `Xp`** → predictor variables (e.g `age`, `thickness`)
## Odds Ratio
```r
m <- glm(Y ~ X1 + X2 + ... + Xp, data = DATA, family = binomial())
# Odds Ratio
exp(coef(m))
```
- **`DATA`** → your data frame (e.g., `Melanoma`)
- **`Y`** → the outcome variable (e.g., `dead`)
- **`X1`, `X2`, ..., `Xp`** → predictor variables (e.g `age`, `thickness`)
## Predict Probabilities
```r
m <- glm(Y ~ X1 + X2 + ... + Xp, data = DATA, family = binomial())
# Predict probabilities
new_df <- data.frame(X1 = VAL1, X2 = VAL2, ..., Xp = VALp)
predict(m, newdata = new_df, type = "response")
```
- **`DATA`** → your data frame (e.g., `Melanoma`)
- **`Y`** → the outcome variable (e.g., `dead`)
- **`X1`, `X2`, ..., `Xp`** → predictor variables (e.g `age`, `thickness`)
- **`VAL1`, `VAL2`, ..., `VALp`** → predictor values (e.g `55`, `2.8`)