Logistic Regression Models

The logistic regression model, interpreting coefficients (odds & odds ratios), fitting models in R, and predicting probabilities.

1 Google Colab

Copy the following code and put it in a code cell in Google Colab. Only do this if you are using a completely new notebook.

# This code will load the R packages we will use
install.packages(c("rcistats"),
                 repos = c("https://inqs909.r-universe.dev", 
                           "https://cloud.r-project.org"))
library(tidyverse)
library(csucistats)
library(MASS)


# Uncomment and run for themes
# csucistats::install_themes()
# library(ThemePark)
# library(ggthemes)

# Outcome: 1 = died from Melanoma, 0 = did not
Melanoma$dead <- ifelse(Melanoma$status == 1, 1, 0)

2 Data

2.1 Melanoma

Melanoma is a type of skin cancer arising from melanin‑producing cells. It is dangerous because it can metastasize to other parts of the body.

2.2 Outcome of interest

We want to understand how predictors affect survival during a study period. We therefore code a binary outcome dead.

2.3 Data

We use MASS::Melanoma with:- dead (1 = died of Melanoma, 0 = otherwise),- sex (1 = male, 0 = female),- age (years),- thickness (tumour thickness in mm),- ulcer (1 = present, 0 = absent).

2.4 Plot

Code
ggplot(Melanoma, aes(thickness, dead)) +
  geom_point(alpha = 0.7) +
  labs(x = "Tumour thickness (mm)", y = "Dead (1=yes, 0=no)") +
  stat_smooth(method = "glm",
              se = F,
              method.args = list(family = "binomial"),
              color = "blue") +
  theme_bw()

3 Logistic regression in R

Logistic regression models the relationship between predictors and a binary outcome by making the log‑odds a linear function of the predictors.

3.1 Fitting Model

Template:

# Logistic regression (binomial GLM)
glm(Y ~ X1 + X2 + ... + Xp,
    data = DATA,
    family = binomial())

Example (Melanoma):

Model dead by sex, age, thickness, and ulcer:

glm(dead ~ sex + age + thickness + ulcer,
    data = Melanoma,
    family = binomial())
#> 
#> Call:  glm(formula = dead ~ sex + age + thickness + ulcer, family = binomial(), 
#>     data = Melanoma)
#> 
#> Coefficients:
#> (Intercept)          sex          age    thickness        ulcer  
#>    -2.39860      0.40767      0.00402      0.11253      1.31314  
#> 
#> Degrees of Freedom: 204 Total (i.e. Null);  200 Residual
#> Null Deviance:       242.4 
#> Residual Deviance: 210.3     AIC: 220.3

The fitted logit equation can be written generically as

3.2 Odds ratios (exponentiated coefficients)

Template:

# Logistic regression (binomial GLM)
m <- glm(Y ~ X1 + X2 + ... + Xp,
         data = DATA,
         family = binomial())
exp(coef(m))

Example:

m <- glm(dead ~ sex + age + thickness + ulcer,
    data = Melanoma,
    family = binomial())
exp(coef(m))
#> (Intercept)         sex         age   thickness       ulcer 
#>  0.09084468  1.50331052  1.00402853  1.11910869  3.71781566

4 Prediction with models

4.1 Model

\[ \hat P(Y=1) = \frac{e^{\hat\beta_0 + \hat\beta_1 X_1 + \cdots + \hat\beta_p X_p}}{1 + e^{\hat\beta_0 + \hat\beta_1 X_1 + \cdots + \hat\beta_p X_p}}. \]

Template:

xglm <- glm(Y ~ X1 + X2 + ... + Xp,
            data = DATA,
            family = binomial())

new_df <- data.frame(X1 = VAL1, X2 = VAL2, ..., Xp = VALp)
predict(xglm, newdata = new_df, type = "response")  # probabilities

Examples (Melanoma):

  1. Fit a model with gender, age, thickness, and ulcer present
xglm <- glm(dead ~ sex + age + thickness + ulcer,
    data = Melanoma,
    family = binomial())
  1. Male, age 75, thickness 2.9, ulcer present
new1 <- data.frame(sex = 1, age = 75, thickness = 2.9, ulcer = 1)
predict(xglm, new1, type = "response")
#>         1 
#> 0.4875223
  1. Male, age 75, thickness 2.9, ulcer absent
new2 <- data.frame(sex = 1, age = 75, thickness = 2.9, ulcer = 0)
predict(xglm, new2, type = "response")
#>         1 
#> 0.2037438
  1. Female, thickness 2.9, ulcer present — compare ages 55 vs 75
new3 <- tibble(sex = 0, age = c(55, 75), thickness = 2.9, ulcer = 1)
pred3 <- predict(xglm, new3, type = "response")
tibble(age = new3$age, probability = scales::percent(pred3, accuracy = 0.1))
#> # A tibble: 2 × 2
#>     age probability
#>   <dbl> <chr>      
#> 1    55 36.9%      
#> 2    75 38.8%

5 Appendix: quick reference (copy‑paste)

5.1 Simple Logistic Regression

# Fit
glm(Y ~ X, data = DATA, family = binomial())
  • DATA → your data frame (e.g., Melanoma)
  • Y → the outcome variable (e.g., dead)
  • X → predictor variables (e.g thickness)

5.2 Logistic Regression

# Fit
glm(Y ~ X1 + X2 + ... + Xp, data = DATA, family = binomial())
  • DATA → your data frame (e.g., Melanoma)
  • Y → the outcome variable (e.g., dead)
  • X1, X2, …, Xp → predictor variables (e.g age, thickness)

5.3 Odds Ratio

m <- glm(Y ~ X1 + X2 + ... + Xp, data = DATA, family = binomial())

# Odds Ratio
exp(coef(m))
  • DATA → your data frame (e.g., Melanoma)
  • Y → the outcome variable (e.g., dead)
  • X1, X2, …, Xp → predictor variables (e.g age, thickness)

5.4 Predict Probabilities

m <- glm(Y ~ X1 + X2 + ... + Xp, data = DATA, family = binomial())
# Predict probabilities
new_df <- data.frame(X1 = VAL1, X2 = VAL2, ...,  Xp = VALp)
predict(m, newdata = new_df, type = "response")
  • DATA → your data frame (e.g., Melanoma)
  • Y → the outcome variable (e.g., dead)
  • X1, X2, …, Xp → predictor variables (e.g age, thickness)
  • VAL1, VAL2, …, VALp → predictor values (e.g 55, 2.8)