Reputation: 154
I have a dataframe "df1" composed of 1245 lines, with a column text (object type) and topic (object type). The topic column contain different number corresponding to text label. Here is an exemple :
text topic
1207 June 2019: The French Facility for Global Envi... 3 12 7
1208 May 2019: Participants from multi-stakeholder ... 8
1209 2 July 2019: UN Member States have reached agr... 1 7
1210 30 June 2019: The G20 Leaders’ Summit and asso... 7 8 9 11 12 13 14 15 17
I would like to obtain an one hot encoding form like that (also add an 'S' before number in column name) :
text S1 S2 S3 ..... S7 S8 S9 etc.
1207 June 2019: The French Facility for Global Envi... 0 0 1 ..... 1 0 0
1208 May 2019: Participants from multi-stakeholder ... 0 0 0 ...... 0 1 0
1209 2 July 2019: UN Member States have reached agr... 1 0 0 ..... 1 0 0
1210 30 June 2019: The G20 Leaders’ Summit and asso... 0 0 0 ......1 1 1
The 'difficulty' here is that my text is multilabeled so code for simple one-hot-encoding do not works for my case. Any idea ?
Upvotes: 0
Views: 826
Reputation: 1644
With a slightly modified data (for readability reasons...):
from io import StringIO
import pandas as pd
s = """id,text,topic
1207,One,1 2 5
1208,Two,3
1209,Three,1 4
1210,Four,1 2 3"""
df = pd.read_csv(StringIO(s))
df.topic = df.topic.str.split(' ').apply(lambda x: [int(y) for y in x])
b = np.zeros((df.topic.size, max(max(x) for x in df.topic) + 1))
for i in df.index:
b[i, df.topic[i]] = 1
idx = {'id': df.id, 'text': df.text}
idx.update({f'S{i}': b[:, i] for i in range(1, b.shape[1])})
idx
df = pd.DataFrame(idx)
print(df.set_index('id').to_markdown())
this gives you:
| id | text | S1 | S2 | S3 | S4 | S5 |
|-----:|:-------|-----:|-----:|-----:|-----:|-----:|
| 1207 | One | 1 | 1 | 0 | 0 | 1 |
| 1208 | Two | 0 | 0 | 1 | 0 | 0 |
| 1209 | Three | 1 | 0 | 0 | 1 | 0 |
| 1210 | Four | 1 | 1 | 1 | 0 | 0 |
Upvotes: 1
Reputation: 442
With only using pandas, you can do something like this:
import pandas as pd
data = [['June 2019: The French Facility for Global Envi...', '3 12 7'],
['May 2019: Participants from multi-stakeholder ...','8'],
['2 July 2019: UN Member States have reached agr...','1 7'],
['30 June 2019: The G20 Leaders’ Summit and asso...','7 8 9 11 12 13 14 15 17']]
df = pd.DataFrame(data , columns=['text', 'topic'])
# creating list of strings where each value is one number out of topic column
unique_values = ' '.join(df['topic'].values.tolist()).split(' ')
# creating new column for each value in unique_values
for number in unique_values:
df[f'S{number}'] = 0
# changing 0 to 1 for every Snumber column where topic contains number
for idx, row in df.iterrows():
for number in row['topic'].split(' '):
df.loc[idx, f'S{number}'] = 1
df.drop('topic', axis=1, inplace=True)
Result:
text S3 S12 S7 S8 S1 S9 S11 S13 S14 S15 S17
0 June 2019: The French Facility for Global Envi... 1 1 1 0 0 0 0 0 0 0 0
1 May 2019: Participants from multi-stakeholder ... 0 0 0 1 0 0 0 0 0 0 0
2 2 July 2019: UN Member States have reached agr... 0 0 1 0 1 0 0 0 0 0 0
3 30 June 2019: The G20 Leaders’ Summit and asso... 0 1 1 1 0 1 1 1 1 1 1
Upvotes: 1