Dummy Variables on training and testing set resulting in different size dataframe output

Question

I am encoding my dataframes (training & testing) with onehot using pd.get_dummies(). However, both dataframes are rather large and I noticed that it outputs different column sizes. 271 vs 290. This is due to certain qualitative variables having values in one dataframe vs. not in the other.

Is there a command I can use with pd.get_dummies to make sure that I am getting an empty column with 0's when these variables are present in the other dataframe?

BENY · Accepted Answer

When you have the dataframe , and would like to transform object to dummies variable, dot not split it before using get_dummies

 df = pd.get_dummies(df)
 train = df[cond]
 test = df.drop(train.index)

To fix your code

df = pd.get_dummies(pd.concat([train , test]))
train = df[df.index.isin(train.index)]
test = df.drop(train.index)

Dummy Variables on training and testing set resulting in different size dataframe output

Answers (2)

Related Questions