Reputation: 4333
I'm learning how to use Lasso and Ridge with sklearn in Python. I am given the folds in a column. I want to find the best parameter based on a 5 fold cross validation.
My data looks like the following:
mpg cylinders displacement horsepower weight acceleration origin fold
0 18 8 307 130 3504 12.0 1 3
1 15 8 350 165 3693 11.5 1 0
2 18 8 318 150 3436 11.0 1 2
3 16 8 304 150 3433 12.0 1 2
4 17 8 302 140 3449 10.5 1 3
reg_para = [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]
mpg is the y/target variable and the other columns are the predictor. the last column contains the folds. I want to run a Lasso and Ridge and find the best parameter. The problem I am having is incorporating the specified folds in cross validation. Here is what I have so far (for Lasso):
from sklearn.linear_model import Lasso, LassoCV
lasso_model = LassoCV(cv=5, alphas=reg_para)
lasso_fit = lasso_model.fit(X,y)
Is there a simple way to incorporate the fold splits? Any help is greatly appreciated
Upvotes: 1
Views: 911
Reputation: 14377
If your data are in a pandas dataframe, then all you need to do is access that column
fold_labels = df["fold"]
from sklearn.cross_validation import LeaveOneLabelOut
cv = LeaveOneLabelOut(fold_labels)
lasso_model = LassoCV(cv=cv, alphas=reg_para)
So if you obtain the fold labels in an array fold_labels
you can just use LeaveOneLabelOut
(sorry for the non-functional code. It should be sufficient to elucidate the idea though.)
Upvotes: 1