Quick analysis of the canonical Leukemia dataset for automated variable selection. Code here is based on this example. They describe the data as follows:
We use the preprocessed data of Dettling (2004) retrieved from the supplementary materials accompanying Friedman et al (2010). The data are represented as a 72 x 3,571 matrix of gene expression values (variable X), and a vector of 72 binary disease outcomes (variable y).
library(tidyverse)
library(glmnet)
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
message = FALSE,
fig.width = 8,
fig.height = 4
)
theme_set(theme_bw() + theme(legend.position = "bottom"))
set.seed(1)
We import the data from Leukemia.RData
and construct \(y\) and \(x\).
load("Leukemia.RData")
x = Leukemia$x
y = Leukemia$y
lambda = 10^(seq(0,-2,-0.05))
We fit the lasso model for each tuning parameter in a pre-defined grid, and then compare the fits using cross validation.
fit.glmnet =
glmnet(x, y, family = "binomial", lambda = lambda)
out.cv.glmnet =
cv.glmnet(x, y, family = "binomial", type.measure = "class", lambda = lambda)
lambda_opt = out.cv.glmnet$lambda.1se
The plot below shows coefficient estimates corresponding to a subset of the genes in the dataset – these are genes that have non-zero coefficients for at least one tuning parameter value in the pre-defined grid. As lambda increases, the coefficient values are shrunk to zero and the model becomes more sparse. The optimal tuning parameter, determined using cross validation, is shown by a vertical blue line.
broom::tidy(fit.glmnet) %>%
select(term, lambda, estimate) %>%
complete(term, lambda, fill = list(estimate = 0) ) %>%
filter(term != "(Intercept)") %>%
ggplot(aes(x = log(lambda, 10), y = estimate, group = term, color = term)) +
geom_path() +
geom_vline(xintercept = log(lambda_opt, 10), color = "blue", size = 1.2) +
theme(legend.position = "none")