Reputation: 1614
I have a DataFrame with a column 'description' and I would like to make a one hot encoding that includes the word count of the words in the description
description
0 test words that describe things
1 more and more words here
2 things test
Desired output
test words that describe things more here and
0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0 0.0 2.0 1.0 1.0
2 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
The current solution I have is:
one_hot = df.apply(lambda x: pd.Series(x.description).str.split(expand=True).stack().value_counts(), axis=1)
This gets very slow (2.6 ms per row) on a large dataset (130K rows) and I was wondering if there was a better solution. I would also like to remove words that show up in only one entry.
test words things
0 1.0 1.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 1.0
Upvotes: 3
Views: 908
Reputation: 75130
IIUC, for counts , you can do a groupby+sum
on axis=1
after get_dummies
final = (pd.get_dummies(df['description'].str.split(expand=True))
.groupby(lambda x: x.split('_')[-1],axis=1).sum())
Or with apply(slower):
df['description'].str.split(expand=True).apply(pd.value_counts,axis=1).fillna(0)
and describe here more test that things words
0 0 1 0 0 1 1 1 1
1 1 0 1 2 0 0 0 1
2 0 0 0 0 1 0 1 0
Upvotes: 3