Variable selection using LASSO

Data analysts and data scientists use different regression methods for different kinds of analytics problems. From the simplest ones to the most complex ones. One of the most talked-about methods is the Lasso. Lasso was often described as one of the most useful linear regression tools and we are about to find out why.

LASSO is actually an abbreviation for “Least absolute shrinkage and selection operator”, which basically summarizes how Lasso regression works. Lasso does regression analysis using a shrinkage parameter “where data are shrunk to a certain central point” [1] and performs variable selection by forcing the coefficients of “not-so-significant” variables to become zero through a penalty.

Now to understand more about this powerful tool, we will apply this example to a real-world problem.

We got our data from Kaggle.com [2] about a few breast cancer diagnostic cases. This will be used for the entire demo session. The dataset contains characteristics of the cell nuclei present in the digitized image of a fine needle aspirate (FNA) of a breast mass. The problem we are solving for is to identify what are the physical characteristics of the breast mass that significantly tells us whether it is benign or malignant.

Prepare data

We divide our data into a training set and a test set.

	library(Matrix)
	library(glmnet)
	library(pROC)
	library(caret)

	# Import dataset
	data1 = read.csv(file = "./data/input/breast-cancer.csv")
	data1$diagnosis<-ifelse(data1$diagnosis=='M', 1,0)
	data2 = data.matrix(data1)
	Matrix(data2, sparse = TRUE)

	set.seed(6789)

	# Split the data to train and test
	split = sample(nrow(data1), floor(0.7*nrow(data1)))
	train = data1[split,]
	test = data1[-split,]

	train_sparse = sparse.model.matrix(~., train[,3:32])
	test_sparse = sparse.model.matrix(~., test[,3:32])

view raw data prep.r hosted with ❤ by GitHub

Train the model

After training the training set, we used cross-validation to determine the best lambda.

	# Train the model
	glmmod = glmnet(x=train_sparse, y=as.factor(train[,2]), alpha=1, family="binomial")
	plot(glmmod, xvar="lambda")
	glmmod

	coef(glmmod)[,100]

	# Try cross validation lasso
	cv.glmmod = cv.glmnet(x=train_sparse, y=as.factor(train[,2]), alpha=1, family="binomial")
	plot(cv.glmmod)

	lambda = cv.glmmod$lambda.1se # the value of lambda used by default
	lambda

	coefs = as.matrix(coef(cv.glmmod)) # convert to a matrix (618 by 1)
	ix = which(abs(coefs[,1]) > 0)
	length(ix)

	coefs[ix,1, drop=FALSE]

	test$cv.glmmod <- predict(cv.glmmod,newx=test_sparse,type='response')[,1]

	########################

	# Get optimal lambda
	best.lambda <- cv.glmmod$lambda.min
	best.lambda

view raw train model.r hosted with ❤ by GitHub

Predict

We predict the response variable for the test set, then, looked at the confusion matrix.

	# Predict the test set using the model
	pred_lasso = predict(glmmod, test_sparse, type="response", s=best.lambda)
	pred_lasso

	# Apply a threshold
	new_pred_lasso = ifelse(pred_lasso >= 0.5, 1, 0)
	new_pred_lasso = data.frame(new_pred_lasso)
	data_lasso = cbind(test[,2], new_pred_lasso)
	names(data_lasso) = c("actual", "pred")
	xtab_lasso = table(data_lasso$actual, data_lasso$pred)

	cm_lasso = confusionMatrix(xtab_lasso)

view raw predict.r hosted with ❤ by GitHub

Check performance

We compared the actual values of the response set versus the predicted values.

	# Get performance measures
	overall_accuracy_lasso = cm_lasso$overall['Accuracy']

view raw perf.r hosted with ❤ by GitHub

To compare, we will also solve the same problem using the Ordinary Least Squares method and then compare their results.

Train the model

	# Train the model (Logistic regression)
	lmmod = lm(diagnosis ~ . , data = train[,2:32])
	summary(lmmod)

	coeftest(lmmod, vcov. = vcovHC, type = "HC1")

view raw train model2.r hosted with ❤ by GitHub

Predict

	# Predict the test set using the model
	pred_ols = predict(lmmod, test[,3:32], type="response")
	pred_ols

	# Apply a threshold
	new_pred_ols = ifelse(pred_ols >= 0.5, 1, 0)
	new_pred_ols = data.frame(new_pred_ols)
	data_ols = cbind(test[,2], new_pred_ols)
	names(data_ols) = c("actual", "pred")
	xtab_ols = table(data_ols$actual, data_ols$pred)

	cm_ols = confusionMatrix(xtab_ols)

view raw predict2.r hosted with ❤ by GitHub

Check performance

	# Get performance measures
	overall_accuracy_ols = cm_ols$overall['Accuracy']

view raw perf2.r hosted with ❤ by GitHub

Now, comparing the accuracy of the two methods, Lasso got 166/171 correctly giving a 97.01% accuracy, while ordinary least squares got 162/171 correct predictions giving a 94.74%. However, since we are expecting this kind of performance because of the distribution of benign-to-malignant cases, let us look at the F1 of both models. This is to put equal importance on the number of False Positive (or non-malignant cases being classified as malignant) and False Negative (or malignant cases being classified as non-malignant) as they are both significant in our cancer problem. We want to, as much as possible, minimize the misclassifications as the classification determine what specific care or health measure should be provided to the patient. Looking at F1, Lasso gave us a 97.70% while ordinary least-squares gave us 95.90%. Again, Lasso outperformed the least-squares method.

It might seem that the two has almost the same performance and that we can just use either of the two for this specific problem. However, if we dig into what the model looks like and how they were formulated, we can easily see the significant difference between the two methods.

Examining the OLS model, all the input variables in the dataset are considered in the model. Please refer to the image below for the coefficients.

Now, looking at the Lasso model, we will notice that there are only a few variables being taken into account in the model (only 11/30 independent variables). The rest are ignored or treated by the model as not significant in the outcome of the dependent variable. Yet, the accuracy of the model is at around 97%, even exceeding the model which takes into account all the independent variables! Refer to the below image for the model.

We found that mean texture, mean concave points, mean fractal dimension, standard error in radius, standard error in fractal dimension, worst radius, worst texture, worst smoothness, worst concavity, worst concave points, and worst symmetry, altogether, strongly identifies whether cell nuclei in a breast mass is benign or malignant.

What the above is telling us is that, sometimes, it is necessary to let go of other variables that are making the model unstable. Because these noisy/irrelevant variables encourage the model to fit to noise, also known as overfitting.

Let’s look at the significant features of LASSO why it worked better than OLS in this specific case. As mentioned from the beginning, one important feature of LASSO is variable selection. Lasso selects only the significant variables in the model. If we will have a closer look at the data that we have, we will notice that there are a lot of predictors and that some of the independent variables are actually related to one another or we can group them. This actually already give us a hint that it might be necessary to remove some of the variables.

Getting predictions, it is, therefore, easier to get predictions as we need to prepare fewer features during inference. Unlike in OLS where we have to input all the values from the dataset in order to obtain the response value.

Lastly, let us summarise the important characteristics of Lasso in general. Lasso is a supervised algorithm wherein the process identifies the variables that are strongly associated with the response variable. This is called variable selection. Then, Lasso forces the coefficients of the variables towards zero. This is now the process of shrinkage. This is to make the model less sensitive to the new data set. These processes help alleviate the limits of human cognition as fewer input variables are selected.

If you would like to learn more about Lasso regression, I recommend taking a course in Coursera [3] or just reading through this [4].

That’s all for the post. We’d love to hear your thoughts on these articles and anything else data related. SpectData is a boutique Data Science Consultancy with a niche in Artificial Intelligence and Natural Language Processing. This article is written by our Data Scientist, Marriane M.