scutnex
scutnex

Reputation: 861

Convert pandas DataFrame column of comma separated strings to one-hot encoded

I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.

For example,

data = {"mesh": ["A, B, C", "C,B", ""]}

From this I would like to get a dataframe consisting of:

index      A       B.     C
0          1       1      1
1          0       1      1
2          0       0      0

How can I do this?

Upvotes: 17

Views: 5178

Answers (3)

Edu Marín
Edu Marín

Reputation: 89

If categories are controlled (you know how many and who they are), best answer is by @Tejeshar Gurram. But, what if you have lots of potencial categories and you are not interested in all of them. Say:

s = pd.Series(['A,B,C,', 'B,C,D', np.nan, 'X,W,Z'])
 
0    A,B,C,
1     B,C,D
2       NaN
3     X,W,Z
dtype: object

If you are only interested in categories B and C for the final df of dummies, I've found this workaround does the job:

cat_list = ['B', 'C']
list_of_lists = [ (s.str.contains(cat_, regex=False)==True).astype(bool).astype(int).to_list()  for cat_ in cat_list]
data = {k:v for k,v in zip(cat_list,list_of_lists)}
pd.DataFrame(data)

   B  C
0  1  0
1  0  1
2  0  0
3  0  0

Upvotes: 0

cs95
cs95

Reputation: 403050

Note that you're not dealing with OHEs.

str.split + stack + get_dummies + sum

df = pd.DataFrame(data)
df

      mesh
0  A, B, C
1      C,B
2         

(df.mesh.str.split('\s*,\s*', expand=True)
   .stack()
   .str.get_dummies()
   .sum(level=0))
df

   A  B  C
0  1  1  1
1  0  1  1
2  0  0  0

apply + value_counts

(df.mesh.str.split(r'\s*,\s*', expand=True)
   .apply(pd.Series.value_counts, 1)
   .iloc[:, 1:]
   .fillna(0, downcast='infer'))

   A  B  C
0  1  1  1
1  0  1  1
2  0  0  0

pd.crosstab

x = df.mesh.str.split('\s*,\s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df

col_0  A  B  C
row_0         
0      1  1  1
1      0  1  1
2      0  0  0

Upvotes: 24

Tejeshwar Gurram
Tejeshwar Gurram

Reputation: 124

Figured there is a simpler answer, or I felt this as more simple compared to multiple operations that we have to make.

  1. Make sure the column has unique values separated be commas

  2. Use get dummies in built parameter to specify the separator as comma. The default for this is pipe separated.

    data = {"mesh": ["A, B, C", "C,B", ""]}
    sof_df=pd.DataFrame(data)
    sof_df.mesh=sof_df.mesh.str.replace(' ','')
    sof_df.mesh.str.get_dummies(sep=',')
    

OUTPUT:

    A   B   C
0   1   1   1
1   0   1   1
2   0   0   0

Upvotes: 8

Related Questions