Reputation: 23
I have imported a json file and I now have a data frame where one column (code) that is a list.
index year gvkey code
0 1998 15686 ['TAX', 'ENVR', 'HEALTH']
1 2005 15372 ['EDUC', 'TAX', 'HEALTH', 'JUST']
2 2001 27486 ['LAB', 'TAX', 'HEALTH']
3 2008 84967 ['HEALTH','LAB', 'JUST']
What I want to get is something as follow:
index year gvkey TAX ENVR HEALTH EDUC JUST LAB
0 1998 15686 1 1 1 0 0 0
1 2005 15372 1 0 1 0 1 0
2 2001 27486 1 0 1 0 1 0
3 2008 84967 0 0 1 0 1 1
Following Pandas convert a column of list to dummies I tried the following code (where df is my data frame):
s = pd.Series(df["code"])
l = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
I get the second part of the data right (variables TAX, ENVR, HEALTH, EDUC, JUST and LAB), but loose the first (year and gvkey).
How can I keep the year and gvkey variable?
Upvotes: 1
Views: 393
Reputation: 862481
I think better solution here is use DataFrame.pop
with Series.str.join
and Series.str.get_dummies
:
df = df.join(df.pop('code').str.join('|').str.get_dummies())
print (df)
year gvkey EDUC ENVR HEALTH JUST LAB TAX
index
0 1998 15686 0 1 1 0 0 1
1 2005 15372 1 0 1 1 0 1
2 2001 27486 0 0 1 0 1 1
3 2008 84967 0 0 1 1 1 0
If performance is important use MultiLabelBinarizer
:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df.pop('code')),columns=mlb.classes_)
df = df.join(df1)
print (df)
year gvkey EDUC ENVR HEALTH JUST LAB TAX
index
0 1998 15686 0 1 1 0 0 1
1 2005 15372 1 0 1 1 0 1
2 2001 27486 0 0 1 0 1 1
3 2008 84967 0 0 1 1 1 0
Your solution is possible, but slow, so better avoid it, also sum
working only for unique values, for general solution need max
:
df = df.join(pd.get_dummies(df.pop('code').apply(pd.Series).stack()).max(level=0))
print (df)
year gvkey EDUC ENVR HEALTH JUST LAB TAX
index
0 1998 15686 0 1 1 0 0 1
1 2005 15372 1 0 1 1 0 1
2 2001 27486 0 0 1 0 1 1
3 2008 84967 0 0 1 1 1 0
Upvotes: 5
Reputation: 75080
You can do this by below methods:
Method 1: Convert the column to a dataframe and get dummies , then groupby
on axis=1
and get max:
m = pd.get_dummies(pd.DataFrame(df['code'].tolist())).groupby(lambda x:
x.split('_')[1],axis=1).max()
final1 = df.drop('code',1).assign(**m)
Method 2: Join the list of columns with a |
and use series.str.get_dummies
final2 = df.drop('code',1).assign(**df['code'].str.join('|').str.get_dummies())
Method 3: Your method with concat
s = pd.Series(df["code"])
l = pd.get_dummies(s.apply(pd.Series).stack()).max(level=0)
final3 = pd.concat((df.drop('code',1),l),axis=1)
#or final = df.drop('code',1).assign(**l)
Upvotes: 2