One hot encoding from a dataframe string column with multiple values

Question

I have a dataframe "df1" composed of 1245 lines, with a column text (object type) and topic (object type). The topic column contain different number corresponding to text label. Here is an exemple :

        text                                                topic
1207    June 2019: The French Facility for Global Envi...   3 12 7
1208    May 2019: Participants from multi-stakeholder ...   8
1209    2 July 2019: UN Member States have reached agr...   1 7
1210    30 June 2019: The G20 Leaders’ Summit and asso...   7 8 9 11 12 13 14 15 17

I would like to obtain an one hot encoding form like that (also add an 'S' before number in column name) :

        text                                                S1  S2  S3 ..... S7  S8 S9 etc.
1207    June 2019: The French Facility for Global Envi...   0    0   1  ..... 1   0  0
1208    May 2019: Participants from multi-stakeholder ...   0    0   0 ...... 0   1  0
1209    2 July 2019: UN Member States have reached agr...   1    0   0  ..... 1   0  0
1210    30 June 2019: The G20 Leaders’ Summit and asso...   0    0   0  ......1   1  1

The 'difficulty' here is that my text is multilabeled so code for simple one-hot-encoding do not works for my case. Any idea ?

Leon Markes · Accepted Answer

With only using pandas, you can do something like this:

import pandas as pd


data = [['June 2019: The French Facility for Global Envi...', '3 12 7'],
       ['May 2019: Participants from multi-stakeholder ...','8'],
       ['2 July 2019: UN Member States have reached agr...','1 7'],
       ['30 June 2019: The G20 Leaders’ Summit and asso...','7 8 9 11 12 13 14 15 17']]
df = pd.DataFrame(data , columns=['text', 'topic'])

# creating list of strings where each value is one number out of topic column
unique_values = ' '.join(df['topic'].values.tolist()).split(' ')

# creating new column for each value in unique_values
for number in unique_values:
    df[f'S{number}'] = 0
    
# changing 0 to 1 for every Snumber column where topic contains number
for idx, row in df.iterrows():
    for number in row['topic'].split(' '):
        df.loc[idx, f'S{number}'] = 1
df.drop('topic', axis=1, inplace=True)

Result:


    text                                                S3  S12 S7  S8  S1  S9  S11 S13 S14 S15 S17
0   June 2019: The French Facility for Global Envi...   1   1   1   0   0   0   0   0   0   0   0
1   May 2019: Participants from multi-stakeholder ...   0   0   0   1   0   0   0   0   0   0   0
2   2 July 2019: UN Member States have reached agr...   0   0   1   0   1   0   0   0   0   0   0
3   30 June 2019: The G20 Leaders’ Summit and asso...   0   1   1   1   0   1   1   1   1   1   1

One hot encoding from a dataframe string column with multiple values

Answers (2)

Related Questions