Reputation: 333
I am encoding my dataframes (training & testing) with onehot using pd.get_dummies()
. However, both dataframes are rather large and I noticed that it outputs different column sizes. 271 vs 290. This is due to certain qualitative variables having values in one dataframe vs. not in the other.
Is there a command I can use with pd.get_dummies
to make sure that I am getting an empty column with 0's when these variables are present in the other dataframe?
Upvotes: 1
Views: 693
Reputation: 11171
Your safest bet, when possible, is to convert your column to a categorical datatype that includes all possible values before using get_dummies
. This is especially useful if your training data changes frequently (streaming/frequently updated) and you want maximal compatibility:
x_values = ["a", "b", "c", "d", "e"]
x_type = pd.Categorical(values=x_values)
df = pd.DataFrame(dict(x=["a", "b", "c"], y=[1,2,3]))
dummies that don't know about possible values "d", "e":
x_dummies = pd.get_dummies(df.x)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
dummies that are aware "d", "e" exist, even if not represented in present data:
df["x"] = df["x"].astype(x_cat)
x_dummies = pd.get_dummies(df.x)
a b c d e
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
Upvotes: 0
Reputation: 323226
When you have the dataframe , and would like to transform object to dummies variable, dot not split it before using get_dummies
df = pd.get_dummies(df)
train = df[cond]
test = df.drop(train.index)
To fix your code
df = pd.get_dummies(pd.concat([train , test]))
train = df[df.index.isin(train.index)]
test = df.drop(train.index)
Upvotes: 2