Reputation: 2232
I have a question related to a penalized regression model with Lasso and interpreting returning values. I have text content and want to find each the most predictive words for a class.
Code and Data
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import Lasso
# Import test data
data = pd.read_csv('https://pastebin.com/raw/rXr4kd8S')
# Make ngrams
vectorizer = CountVectorizer(min_df=0.00, max_df=1.0, max_features=1000, stop_words='english', binary=True, lowercase=True, ngram_range=(1, 1))
grams = vectorizer.fit_transform(data['text'])
# Show features (words)
vectorizer.get_feature_names()
# Show Lasso coefficients
def lassoRegression(para1, para2):
lasso = Lasso(alpha = 0, fit_intercept=True, normalize=True, max_iter=1000)
lasso.fit(para1, para2)
return lasso.coef_
model_lasso = lassoRegression(grams, data['label'])
# Sort coefficients
lasso_coef = pd.DataFrame(np.round_(model_lasso, decimals=2), vectorizer.get_feature_names(), columns = ["penalized_regression_coefficients"])
lasso_coef = lasso_coef[lasso_coef['penalized_regression_coefficients'] != 0]
lasso_coef = lasso_coef.sort_values(by = 'penalized_regression_coefficients', ascending = False)
lasso_coef
# Top/Low 10 values
lasso_coef = pd.concat([lasso_coef.head(10),lasso_coef.tail(10)], axis=0)
# Plot
ax = sns.barplot(x = 'penalized_regression_coefficients', y= lasso_coef.index , data=lasso_coef)
ax.set(xlabel='Penalized Regression Coeff.')
plt.show()
Changing alpha causes following problems:
Out: For Lasso(alpha = 0, ...)
ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
penalized_regression_coefficients
data 0.62
awesome 0.33
content 0.31
performs 0.05
enter 0.02
great -0.01
Out: For Lasso(alpha = 0.001, ...)
penalized_regression_coefficients
great -0.93
Out: For Lasso(alpha = 1, ...)
penalized_regression_coefficients
empty
Questions:
alpha = 0
returns an error (but values) and any other alpha setting returns almost nothing. Considering the input data, even after stopword removal, I would have expected more words with corresponding positive and negative weights. Is something wrong here? Note that the data input has intentionally repetitive elements as I hoped to test the reliability of the model that way.Upvotes: 2
Views: 5116
Reputation: 2298
The alpha
refers to the penalty on the elastic net. Called either the lambda
or the alpha
. alpha=0
is equivalent to ordinary least squares. Lasso regression and force coefficients toward 0. The smaller the coefficient the less important it is or less variance it explains. The actual value here will be less important since it will be used in logistic regression because it will end up being used in an exponential. So you last assumption is pretty much correct where you if the coeffienct is possitive then that variable indicates a higher probability of label 1
which each occurrence of that word.
as for why your lasso regression will not converge you can read here
I suggest reading up on the methods more before using them. This course talks a lot about statistics and explains why and when to use lasso regression. If you are familiar with OLS then you can understand the interpretation of the coefficients. If all your other variables hold constant, for each increase in 1 unit of variable data you can expected the response variable Y to increase 0.62 on average. But as I as I said previously this will lead to a percentage change when used in the logistic equation.
please see Cross Validation for more help on statistics.
Upvotes: 1
Reputation: 491
Okay, so a few things here.
I see that you have logistic regression which is not used in your script. You might want to think about using linear v/s logistic regression.
The code is trying to tell you that close to alpha=0 the Lasso regression results are not reliable. Why is this the case? Well if you go to code for the lasso you'll eventually reach - https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx Line 516 where there is a float comparison going on.
What does it mean when your alpha goes slowly towards 0? Well it means that your regression is similar to an OLS regression. Now if your coefficients are quickly disappearing, it implies that your coefficients are very weak in explaining the results.
Your TODO list - 1. Try both OLS and Logistic to see which one is more appropriate 2. Look at the t-statistics and see if any result is significant 3. If nothing is significant, then maybe look at how you setup the regression, there might be a bug in the code. 4. If any of the concepts are unclear, go to the course in mentioned by @lwileczek
Upvotes: 1