Reputation: 151
I'm trying to create a get_dummies Class for my Data which I want to use it in Pipeline later:
class Dummies(BaseEstimator, TransformerMixin):
def transform(self, df):
dummies=pd.get_dummies(df[self.cat],drop_first=True) ## getting dummy cols
df=pd.concat([df,dummies],axis=1) ## concatenating our dummies
df.drop(self.cat,axis=1,inplace=True) ## dropping our original cat_cols
def fit(self, df):
self.cat=[]
for i in df.columns.tolist():
if i[0]=='c': ## My data has categorical cols start with 'c'
self.cat.append(i) ## Storing all my categorical_columns for dummies
else:
continue
Now when I call fit_transform on X_train and then transform X_test
z=Dummies()
X_train=z.fit_transform(X_train)
X_test=z.transform(X_test)
The columns in shape of X_train and X_test are Different:
X_train.shape
X_test.shape
Output:
(10983, 1797) (3661, 1529)
There are more Dummies in X_train than in my X_test. Clearly, my X_test has fewer categories than X_train. How do I write logic in my Class such that the categories in X_test broadcast to the shape of X_train? I want X_test to have the same number of dummy variables as my X_train.
Upvotes: 3
Views: 3929
Reputation: 1261
If we start with two small example dataframes:
train = pd.DataFrame({'job=carpenter': [0, 1, 0],
'job=plumber': [0, 0, 1],
'job=electrician': [1, 0, 0]})
job=carpenter job=plumber job=electrician
0 0 0 1
1 1 0 0
2 0 1 0
test = pd.DataFrame({'job=carpenter': [0, 1, 0],
'job=plumber': [1, 1, 0]})
job=carpenter job=plumber
0 0 1
1 1 1
2 0 0
We can use a dictionary comprehension to get each column in the train set that is missing from the test set, and assign it a value of 0, which will then be used to add the particular column to the test set and fill it with zeroes (because no row in the test set contained any of these missing categories to begin with):
train_cols = list(train.columns)
test_cols = list(test.columns)
cols_not_in_test = {c:0 for c in train_cols if c not in test_cols}
test = test.assign(**cols_not_in_test)
This gives us the following test dataframe:
test
job=carpenter job=plumber job=electrician
0 0 1 0
1 1 1 0
2 0 0 0
Upvotes: 1
Reputation: 4211
What you want to use here (I think) is scikit learn's OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncode(categories = "auto")
X_train_encoded = encoder.fit_transform("X_train")
X_test_encoded = encoder.transform("X_test")
This keeps the fit_transform
syntax and ensures X_test_encoded has the same shape as X_train_encoded. It can also be used in a pipeline as you mentioned instead of Dummies()
. Example:
pipe1=make_pipeline(OneHotEncoder(categories = "auto"), StandardScaler(), PCA(n_components=7), LogisticRegression())
Upvotes: 0