Zachary Wyman
Zachary Wyman

Reputation: 333

Dummy Variables on training and testing set resulting in different size dataframe output

I am encoding my dataframes (training & testing) with onehot using pd.get_dummies(). However, both dataframes are rather large and I noticed that it outputs different column sizes. 271 vs 290. This is due to certain qualitative variables having values in one dataframe vs. not in the other.

Is there a command I can use with pd.get_dummies to make sure that I am getting an empty column with 0's when these variables are present in the other dataframe?

Upvotes: 1

Views: 693

Answers (2)

anon01
anon01

Reputation: 11171

Your safest bet, when possible, is to convert your column to a categorical datatype that includes all possible values before using get_dummies. This is especially useful if your training data changes frequently (streaming/frequently updated) and you want maximal compatibility:

x_values = ["a", "b", "c", "d", "e"]
x_type = pd.Categorical(values=x_values)
df = pd.DataFrame(dict(x=["a", "b", "c"], y=[1,2,3]))

dummies that don't know about possible values "d", "e":

x_dummies = pd.get_dummies(df.x)

   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1    

dummies that are aware "d", "e" exist, even if not represented in present data:

df["x"] = df["x"].astype(x_cat)
x_dummies = pd.get_dummies(df.x)

   a  b  c  d  e
0  1  0  0  0  0
1  0  1  0  0  0
2  0  0  1  0  0

Upvotes: 0

BENY
BENY

Reputation: 323226

When you have the dataframe , and would like to transform object to dummies variable, dot not split it before using get_dummies

 df = pd.get_dummies(df)
 train = df[cond]
 test = df.drop(train.index)

To fix your code

df = pd.get_dummies(pd.concat([train , test]))
train = df[df.index.isin(train.index)]
test = df.drop(train.index)

Upvotes: 2

Related Questions