Reputation: 9345
I'm doing some work in Pandas and getting strange behavior when using pd.concat
. Specifically, I have a DataFrame, df
, and I'm one-hot encoding the zipcode
column. Here's what I'm doing:
zip_encoded = label_binarizer.transform(df["zipcode"])
zip_encoded = pd.DataFrame(zip_encoded, columns=label_binarizer.classes_)
df = df.drop("zipcode", axis=1)
print("df shape:", df.shape)
print("zip encoded shape:", zip_encoded.shape)
result = pd.concat([df, zip_encoded], axis=1)
print("result shape", result.shape)
return label_binarizer, result
This gives the following output:
df shape: (13999, 13)
zip encoded shape: (13999, 10)
result shape (14000, 23)
So I'm just trying to concat along the columns and I expect a result shape of (13999, 23)
but instead I see a shape of (14000, 23)
.
I do the same thing with my test_df
by using the LabelBinarizer
that I fit on my df
. When I do that, I get the even stranger:
df shape: (1000, 13)
zip encoded shape: (1000, 10)
result shape (2000, 23)
When I inspect the new test_df
, all of the non-zipcode columns are filled with NaNs...
Any idea what I'm doing incorrectly?
Thanks!
Upvotes: 1
Views: 464
Reputation: 323276
You may need add index
from df when you create zip_encoded
zip_encoded = label_binarizer.transform(df["zipcode"])
zip_encoded = pd.DataFrame(zip_encoded, columns=label_binarizer.classes_,index=df.index)
Then doing the concat
df = df.drop("zipcode", axis=1)
result = pd.concat([df, zip_encoded], axis=1)
Example , you df index may not from 0 to len(df)
, when you create the df without setting the index
, default is from 0 to len(df)
range , that is why after concat
, the shape different
df1=pd.DataFrame({'A':[1,2]},index=[0,1])
df2=pd.DataFrame({'A':[1,2]},index=[1,2])
print(pd.concat([df1,df2],axis=1))
df2=pd.DataFrame({'A':[1,2]},index=df1.index)
print(pd.concat([df1,df2],axis=1))
A A
0 1.0 NaN
1 2.0 1.0
2 NaN 2.0
A A
0 1 1
1 2 2
Upvotes: 1