Reputation: 141
I want to encode a dataframe that has multiple columns of the same "type", for example:
import pandas as pd
df = pd.DataFrame(data=[["France", "Bupapest", "Sweden", "Paris"], ["Italy", "Frankfurt", "France", "Naples"]], columns=["Countries 1", "Cities 1", "Countries 2", "Cities 2"])
print(df)
Output:
Countries 1 Cities 1 Countries 2 Cities 2
0 France Bupapest Sweden Paris
1 Italy Frankfurt France Naples
How do I encode this dataframe with one hot encoding by passing in column indices which should be considered as one? In this example, I would pass in [0, 2] and [1, 3] because Countries 1 and Countries 2 column has 3 different countries combined and therefore should have 3 categories, not 2 each and the same principle goes for the two countries columns.
Upvotes: 3
Views: 979
Reputation: 323226
I am using wide_to_long
flatten the df , then using factorize
+unstack
s=pd.wide_to_long(df.reset_index(),stubnames=['Countries','Cities'],i='index',j='unstack',sep=' ').apply(lambda x : pd.factorize(x)[0]+1).unstack()
s.columns=s.columns.map('{0[0]} {0[1]}'.format)
s=s.reindex(columns=df.columns)
s
Out[1377]:
Countries 1 Cities 1 Countries 2 Cities 2
index
0 1 1 3 3
1 2 2 1 4
Or get_dummies
s=pd.get_dummies(pd.wide_to_long(df.reset_index(),stubnames=['Countries','Cities'],i='index',j='unstack',sep=' '))
s
Out[1392]:
Countries_France Countries_Italy Countries_Sweden \
index unstack
0 1 1 0 0
1 1 0 1 0
0 2 0 0 1
1 2 1 0 0
Cities_Bupapest Cities_Frankfurt Cities_Naples Cities_Paris
index unstack
0 1 1 0 0 0
1 1 0 1 0 0
0 2 0 0 0 1
1 2 0 0 1 0
Upvotes: 2