Pandas: Concat DataFrames with Unexpected Behavior

Question

I'm doing some work in Pandas and getting strange behavior when using pd.concat. Specifically, I have a DataFrame, df, and I'm one-hot encoding the zipcode column. Here's what I'm doing:

zip_encoded = label_binarizer.transform(df["zipcode"])        
zip_encoded = pd.DataFrame(zip_encoded, columns=label_binarizer.classes_)
df = df.drop("zipcode", axis=1)
print("df shape:", df.shape)
print("zip encoded shape:", zip_encoded.shape)
result = pd.concat([df, zip_encoded], axis=1)
print("result shape", result.shape)
return label_binarizer, result

This gives the following output:

df shape: (13999, 13)
zip encoded shape: (13999, 10)
result shape (14000, 23)

So I'm just trying to concat along the columns and I expect a result shape of (13999, 23) but instead I see a shape of (14000, 23).

I do the same thing with my test_df by using the LabelBinarizer that I fit on my df. When I do that, I get the even stranger:

df shape: (1000, 13)
zip encoded shape: (1000, 10)
result shape (2000, 23)

When I inspect the new test_df, all of the non-zipcode columns are filled with NaNs...

Any idea what I'm doing incorrectly?

Thanks!

BENY · Accepted Answer

You may need add index from df when you create zip_encoded

zip_encoded = label_binarizer.transform(df["zipcode"])        
zip_encoded = pd.DataFrame(zip_encoded, columns=label_binarizer.classes_,index=df.index)

Then doing the concat

df = df.drop("zipcode", axis=1)
result = pd.concat([df, zip_encoded], axis=1)

Example , you df index may not from 0 to len(df), when you create the df without setting the index , default is from 0 to len(df) range , that is why after concat , the shape different

df1=pd.DataFrame({'A':[1,2]},index=[0,1])
df2=pd.DataFrame({'A':[1,2]},index=[1,2])
print(pd.concat([df1,df2],axis=1))
df2=pd.DataFrame({'A':[1,2]},index=df1.index)
print(pd.concat([df1,df2],axis=1))
     A    A
0  1.0  NaN
1  2.0  1.0
2  NaN  2.0
   A  A
0  1  1
1  2  2

Pandas: Concat DataFrames with Unexpected Behavior

Answers (1)

Related Questions