Overview

Quick analysis of the canonical Leukemia dataset for automated variable selection. Code here is based on this example. They describe the data as follows:

We use the preprocessed data of Dettling (2004) retrieved from the supplementary materials accompanying Friedman et al (2010). The data are represented as a 72 x 3,571 matrix of gene expression values (variable X), and a vector of 72 binary disease outcomes (variable y).

library(tidyverse)
library(glmnet)

knitr::opts_chunk$set(echo = TRUE,
  warning = FALSE,
  message = FALSE,
  fig.width = 8,
  fig.height = 4
)

theme_set(theme_bw() + theme(legend.position = "bottom"))

set.seed(1)

Data cleaning

We import the data from Leukemia.RData and construct \(y\) and \(x\).

load("Leukemia.RData")

x = Leukemia$x
y = Leukemia$y

lambda = 10^(seq(0,-2,-0.05))

Data analysis

We fit the lasso model for each tuning parameter in a pre-defined grid, and then compare the fits using cross validation.

fit.glmnet =
  glmnet(x, y, family = "binomial", lambda = lambda)

out.cv.glmnet =
  cv.glmnet(x, y, family = "binomial", type.measure = "class", lambda = lambda)

lambda_opt = out.cv.glmnet$lambda.1se

The plot below shows coefficient estimates corresponding to a subset of the genes in the dataset – these are genes that have non-zero coefficients for at least one tuning parameter value in the pre-defined grid. As lambda increases, the coefficient values are shrunk to zero and the model becomes more sparse. The optimal tuning parameter, determined using cross validation, is shown by a vertical blue line.

broom::tidy(fit.glmnet) %>% 
  select(term, lambda, estimate) %>% 
  complete(term, lambda, fill = list(estimate = 0) ) %>% 
  filter(term != "(Intercept)") %>% 
  ggplot(aes(x = log(lambda, 10), y = estimate, group = term, color = term)) + 
  geom_path() + 
  geom_vline(xintercept = log(lambda_opt, 10), color = "blue", size = 1.2) +
  theme(legend.position = "none")