Reputation: 21
I have been trying to implement logistic regression for a classification problem, but it is giving me really bizarre results. I have gotten decent results with gradient boosting and random forests so I thought of getting to basic and see what best can I achieve. Can you help me point out what am I doing wrong that is causing this overfitting? You can get the data from https://www.kaggle.com/c/santander-customer-satisfaction/data
Here is my code:
import pandas as pd
import numpy as np
train = pd.read_csv("path")
test = pd.read_csv("path")
test["TARGET"] = 0
fullData = pd.concat([train,test], ignore_index = True)
remove1 = []
for col in fullData.columns:
if fullData[col].std() == 0:
remove1.append(col)
fullData.drop(remove1, axis=1, inplace=True)
import numpy as np
remove = []
cols = fullData.columns
for i in range(len(cols)-1):
v = fullData[cols[i]].values
for j in range(i+1,len(cols)):
if np.array_equal(v,fullData[cols[j]].values):
remove.append(cols[j])
fullData.drop(remove, axis=1, inplace=True)
from sklearn.cross_validation import train_test_split
X_train, X_test = train_test_split(fullData, test_size=0.20, random_state=1729)
print(X_train.shape, X_test.shape)
y_train = X_train["TARGET"].values
X = X_train.drop(["TARGET","ID"],axis=1,inplace = False)
from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier(random_state=1729)
selector = clf.fit(X, y_train)
from sklearn.feature_selection import SelectFromModel
fs = SelectFromModel(selector, prefit=True)
X_t = X_test.drop(["TARGET","ID"],axis=1,inplace = False)
X_t = fs.transform(X_t)
X_tr = X_train.drop(["TARGET","ID"],axis=1,inplace = False)
X_tr = fs.transform(X_tr)
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(penalty ='l2', C = 1, random_state = 1,
)
from sklearn import cross_validation
scores = cross_validation.cross_val_score(log,X_tr,y_train,cv = 10)
print(scores.mean())
log.fit(X_tr,y_train)
predictions = log.predict(X_t)
predictions = predictions.astype(int)
print(predictions.mean())
Upvotes: 1
Views: 1162
Reputation: 1725
You are not configuring the C parameter - well, technically you are, but only to the default value - which is one of the usual suspects for overfitting. You can have a look at GridSearchCV and play around a bit with several values for the C parameter (e.g. from 10^-5 to 10^5) to see if it eases your problem. Changing the penalty rule to 'l1' might help as well.
Besides, there were several challenges with that competition: It is an imbalanced data set, and the distributions between the training set and the private LB were a bit different. All of this if going to play against you, specially when using simple algorithms like LR.
Upvotes: 1