Reputation: 21
I'm stuck trying to fix an issue. Here is what I'm trying to do :
I'd like to predict missing values (Nan) (categorical one) using logistic regression. Here is my code :
df_1 : my dataset with missing values only in the "Metier" feature (missing values I'm trying to predict)
X_train = pd.get_dummies(df_1[df_1['Metier'].notnull()].drop(columns='Metier'),drop_first = True)
X_test = pd.get_dummies(df_1[df_1['Metier'].isnull()].drop(columns='Metier'),drop_first = True,dummy_na = True)
Y_train = df_1[df_1['Metier'].notnull()]['Metier']
Y_test = df_1[df_1['Metier'].isnull()]['Metier']
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, Y_train)
classifier.score(X_train,Y_train) = 0.705112088833019
BUT when I'm trying to get the prediction on Y_test
It says :
ValueError: X has 42 features per sample; expecting 1423
I would highly appreciate If someone could give me a hand.
Thanks a lot :)
Upvotes: 2
Views: 2453
Reputation: 29732
Rule of thumb is to never use pandas.get_dummies
on multiple dataframe. It does not guarantee you the same dimension.
import pandas as pd
print(pd.get_dummies(['a', 'b', 'c']))
a b c
0 1 0 0
1 0 1 0
2 0 0 1
print(pd.get_dummies(['b', 'c']))
b c
0 1 0
1 0 1
It is only safe if you do pandas.get_dummies
first then divide into x_train
and x_test
. But instead, you can use sklearn.preprocessing.OneHotEncoder
:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(np.reshape(['a', 'b', 'c'], (-1, 1)))
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
ohe.transform(np.reshape(['b', 'c'], (-1, 1))) # Its transform, NOT fit_transform
array([[0., 1., 0.],
[0., 0., 1.]])
Notice that now it properly asserts two different inputs result in the same number of columns.
Upvotes: 1