JEG
JEG

Reputation: 154

One hot encoding from a dataframe string column with multiple values

I have a dataframe "df1" composed of 1245 lines, with a column text (object type) and topic (object type). The topic column contain different number corresponding to text label. Here is an exemple :

        text                                                topic
1207    June 2019: The French Facility for Global Envi...   3 12 7
1208    May 2019: Participants from multi-stakeholder ...   8
1209    2 July 2019: UN Member States have reached agr...   1 7
1210    30 June 2019: The G20 Leaders’ Summit and asso...   7 8 9 11 12 13 14 15 17

I would like to obtain an one hot encoding form like that (also add an 'S' before number in column name) :

        text                                                S1  S2  S3 ..... S7  S8 S9 etc.
1207    June 2019: The French Facility for Global Envi...   0    0   1  ..... 1   0  0
1208    May 2019: Participants from multi-stakeholder ...   0    0   0 ...... 0   1  0
1209    2 July 2019: UN Member States have reached agr...   1    0   0  ..... 1   0  0
1210    30 June 2019: The G20 Leaders’ Summit and asso...   0    0   0  ......1   1  1

The 'difficulty' here is that my text is multilabeled so code for simple one-hot-encoding do not works for my case. Any idea ?

Upvotes: 0

Views: 826

Answers (2)

Albo
Albo

Reputation: 1644

With a slightly modified data (for readability reasons...):

from io import StringIO
import pandas as pd

s = """id,text,topic
1207,One,1 2 5
1208,Two,3
1209,Three,1 4
1210,Four,1 2 3"""

df = pd.read_csv(StringIO(s))
df.topic = df.topic.str.split(' ').apply(lambda x: [int(y) for y in x])

b = np.zeros((df.topic.size, max(max(x) for x in df.topic) + 1))
for i in df.index:
    b[i, df.topic[i]] = 1

idx = {'id': df.id, 'text': df.text}
idx.update({f'S{i}': b[:, i] for i in range(1, b.shape[1])})
idx
df = pd.DataFrame(idx)
print(df.set_index('id').to_markdown())

this gives you:

|   id | text   |   S1 |   S2 |   S3 |   S4 |   S5 |
|-----:|:-------|-----:|-----:|-----:|-----:|-----:|
| 1207 | One    |    1 |    1 |    0 |    0 |    1 |
| 1208 | Two    |    0 |    0 |    1 |    0 |    0 |
| 1209 | Three  |    1 |    0 |    0 |    1 |    0 |
| 1210 | Four   |    1 |    1 |    1 |    0 |    0 |

Upvotes: 1

Leon Markes
Leon Markes

Reputation: 442

With only using pandas, you can do something like this:

import pandas as pd


data = [['June 2019: The French Facility for Global Envi...', '3 12 7'],
       ['May 2019: Participants from multi-stakeholder ...','8'],
       ['2 July 2019: UN Member States have reached agr...','1 7'],
       ['30 June 2019: The G20 Leaders’ Summit and asso...','7 8 9 11 12 13 14 15 17']]
df = pd.DataFrame(data , columns=['text', 'topic'])

# creating list of strings where each value is one number out of topic column
unique_values = ' '.join(df['topic'].values.tolist()).split(' ')

# creating new column for each value in unique_values
for number in unique_values:
    df[f'S{number}'] = 0
    
# changing 0 to 1 for every Snumber column where topic contains number
for idx, row in df.iterrows():
    for number in row['topic'].split(' '):
        df.loc[idx, f'S{number}'] = 1
df.drop('topic', axis=1, inplace=True)

Result:


    text                                                S3  S12 S7  S8  S1  S9  S11 S13 S14 S15 S17
0   June 2019: The French Facility for Global Envi...   1   1   1   0   0   0   0   0   0   0   0
1   May 2019: Participants from multi-stakeholder ...   0   0   0   1   0   0   0   0   0   0   0
2   2 July 2019: UN Member States have reached agr...   0   0   1   0   1   0   0   0   0   0   0
3   30 June 2019: The G20 Leaders’ Summit and asso...   0   1   1   1   0   1   1   1   1   1   1

Upvotes: 1

Related Questions