Nabih Bawazir
Nabih Bawazir

Reputation: 7255

How do I do word tokenisation in pandas data frame

This is my data

No  Text                    
1   You are smart
2   You are beautiful

My expected output

No  Text                   You    are  smart  beautiful                 
1   You are smart            1      1      1          0
2   You are beautiful        1      1      0          1

Upvotes: 1

Views: 78

Answers (1)

jezrael
jezrael

Reputation: 862406

For nltk solution need word_tokenize for list of words, then MultiLabelBinarizer and last join to original:

from sklearn.preprocessing import MultiLabelBinarizer
from  nltk import word_tokenize

mlb = MultiLabelBinarizer()
s = df.apply(lambda row: word_tokenize(row['Text']), axis=1)
df = df.join(pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index))
print (df)
   No               Text  You  are  beautiful  smart
0   1      You are smart    1    1          0      1
1   2  You are beautiful    1    1          1      0

For pure pandas use get_dummies + join:

df = df.join(df['Text'].str.get_dummies(sep=' '))

Upvotes: 3

Related Questions