Reputation: 861
I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.
For example,
data = {"mesh": ["A, B, C", "C,B", ""]}
From this I would like to get a dataframe consisting of:
index A B. C
0 1 1 1
1 0 1 1
2 0 0 0
How can I do this?
Upvotes: 17
Views: 5178
Reputation: 89
If categories are controlled (you know how many and who they are), best answer is by @Tejeshar Gurram. But, what if you have lots of potencial categories and you are not interested in all of them. Say:
s = pd.Series(['A,B,C,', 'B,C,D', np.nan, 'X,W,Z'])
0 A,B,C,
1 B,C,D
2 NaN
3 X,W,Z
dtype: object
If you are only interested in categories B and C for the final df of dummies, I've found this workaround does the job:
cat_list = ['B', 'C']
list_of_lists = [ (s.str.contains(cat_, regex=False)==True).astype(bool).astype(int).to_list() for cat_ in cat_list]
data = {k:v for k,v in zip(cat_list,list_of_lists)}
pd.DataFrame(data)
B C
0 1 0
1 0 1
2 0 0
3 0 0
Upvotes: 0
Reputation: 403050
Note that you're not dealing with OHEs.
str.split
+ stack
+ get_dummies
+ sum
df = pd.DataFrame(data)
df
mesh
0 A, B, C
1 C,B
2
(df.mesh.str.split('\s*,\s*', expand=True)
.stack()
.str.get_dummies()
.sum(level=0))
df
A B C
0 1 1 1
1 0 1 1
2 0 0 0
apply
+ value_counts
(df.mesh.str.split(r'\s*,\s*', expand=True)
.apply(pd.Series.value_counts, 1)
.iloc[:, 1:]
.fillna(0, downcast='infer'))
A B C
0 1 1 1
1 0 1 1
2 0 0 0
pd.crosstab
x = df.mesh.str.split('\s*,\s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df
col_0 A B C
row_0
0 1 1 1
1 0 1 1
2 0 0 0
Upvotes: 24
Reputation: 124
Figured there is a simpler answer, or I felt this as more simple compared to multiple operations that we have to make.
Make sure the column has unique values separated be commas
Use get dummies in built parameter to specify the separator as comma. The default for this is pipe separated.
data = {"mesh": ["A, B, C", "C,B", ""]}
sof_df=pd.DataFrame(data)
sof_df.mesh=sof_df.mesh.str.replace(' ','')
sof_df.mesh.str.get_dummies(sep=',')
A B C
0 1 1 1
1 0 1 1
2 0 0 0
Upvotes: 8