Shape mismatch when One-Hot-Encoding Train and Test data. Train_Data has more Dummy Columns than Test_data while using get_dummies with pipeline

I'm trying to create a get_dummies Class for my Data which I want to use it in Pipeline later:

class Dummies(BaseEstimator, TransformerMixin):
     def transform(self, df):
           dummies=pd.get_dummies(df[self.cat],drop_first=True) ## getting dummy cols
           df=pd.concat([df,dummies],axis=1) ## concatenating our dummies
           df.drop(self.cat,axis=1,inplace=True) ## dropping our original cat_cols

     def fit(self, df):
           self.cat=[]    
           for i in df.columns.tolist():    
               if i[0]=='c': ## My data has categorical cols start with 'c'  
                  self.cat.append(i)  ## Storing all my categorical_columns for dummies
              else:
                continue

Now when I call fit_transform on X_train and then transform X_test

z=Dummies()
X_train=z.fit_transform(X_train)
X_test=z.transform(X_test)

The columns in shape of X_train and X_test are Different:

X_train.shape
X_test.shape

Output:

(10983, 1797) (3661, 1529)

There are more Dummies in X_train than in my X_test. Clearly, my X_test has fewer categories than X_train. How do I write logic in my Class such that the categories in X_test broadcast to the shape of X_train? I want X_test to have the same number of dummy variables as my X_train.

Upvotes: 3

Answers (3)

James Dellinger

Reputation: 1261

If we start with two small example dataframes:

train = pd.DataFrame({'job=carpenter': [0, 1, 0],
                   'job=plumber': [0, 0, 1],
                   'job=electrician': [1, 0, 0]})

    job=carpenter   job=plumber  job=electrician
0               0             0                1
1               1             0                0
2               0             1                0


test = pd.DataFrame({'job=carpenter': [0, 1, 0],
                   'job=plumber': [1, 1, 0]})

    job=carpenter   job=plumber
0               0             1
1               1             1
2               0             0

We can use a dictionary comprehension to get each column in the train set that is missing from the test set, and assign it a value of 0, which will then be used to add the particular column to the test set and fill it with zeroes (because no row in the test set contained any of these missing categories to begin with):

train_cols = list(train.columns)
test_cols = list(test.columns)
cols_not_in_test = {c:0 for c in train_cols if c not in test_cols}
test = test.assign(**cols_not_in_test)

This gives us the following test dataframe:

test

   job=carpenter   job=plumber  job=electrician
0              0             1                0
1              1             1                0
2              0             0                0

Upvotes: 1

MaximeKan

Reputation: 4211

What you want to use here (I think) is scikit learn's OneHotEncoder

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncode(categories = "auto")
X_train_encoded = encoder.fit_transform("X_train")
X_test_encoded = encoder.transform("X_test")

This keeps the fit_transform syntax and ensures X_test_encoded has the same shape as X_train_encoded. It can also be used in a pipeline as you mentioned instead of Dummies(). Example:

pipe1=make_pipeline(OneHotEncoder(categories = "auto"), StandardScaler(), PCA(n_components=7), LogisticRegression())

Upvotes: 0

Razaay

Reputation: 46

You could append both dataframes and then do get_dummies().

Upvotes: 1

Shape mismatch when One-Hot-Encoding Train and Test data. Train_Data has more Dummy Columns than Test_data while using get_dummies with pipeline

Answers (3)

Related Questions