Reputation: 8669
I have a dataframe with sentences and a dictionary of terms grouped into topics, where I want to count the number of term matches for each topic.
import pandas as pd
terms = {'animals':["fox","deer","eagle"],
'people':['John', 'Rob','Steve'],
'games':['basketball', 'football', 'hockey']
}
df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['The quick brown fox was playing basketball today','John and Rob visited the eagles nest, the foxes ran away','Bill smells like a wet dog','Steve threw the football at a deer. But the football missed','Sheriff John does not like hockey']
})
So far I have created columns for the topics and marked it with 1 if a word is present by iterating over the dictionary.
df = pd.concat([df, pd.DataFrame(columns=list(terms.keys()))])
for k, v in terms.items():
for val in v:
df.loc[df.Foo.str.contains(val), k] = 1
print (df)
and I get:
>>>
Foo Score animals games \
0 The quick brown fox was playing basketball today 4 1 1
1 John and Rob visited the eagles nest, the foxe... 6 1 NaN
2 Bill smells like a wet dog 2 NaN NaN
3 Steve threw the football at a deer. But the fo... 7 1 1
4 Sheriff John does not like hockey 8 NaN 1
people
0 NaN
1 1
2 NaN
3 1
4 1
What is the best way to count the number of words for each topic that appears in the sentence? and is there a more efficient way of looping over the dictionary without using cython
?
Upvotes: 3
Views: 3664
Reputation: 863226
You can use split
with stack
what is 5 times faster as Counter
solution:
df1 = df.Foo.str.split(expand=True).stack()
.reset_index(level=1, drop=True)
.reset_index(name='Foo')
for k, v in terms.items():
df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
#print df1
print df1.groupby('index').sum().astype(int)
games animals people
index
0 1 1 0
1 0 2 2
2 0 0 0
3 2 1 1
4 1 0 1
Timings:
In [233]: %timeit a(df)
100 loops, best of 3: 4.9 ms per loop
In [234]: %timeit b(df)
10 loops, best of 3: 25.2 ms per loop
Code:
def a(df):
df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')
for k, v in terms.items():
df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
return df1.groupby('index').sum().astype(int)
def b(df):
from collections import Counter
df1 = pd.DataFrame(terms)
res = []
for i,r in df.iterrows():
s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
res.append(pd.DataFrame(s).T)
return pd.concat(res)
Upvotes: 2
Reputation: 31181
I would go for Counter
and replace
:
from collections import Counter
df1 = pd.DataFrame(terms)
res = []
for i,r in df.iterrows():
s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
res.append(pd.DataFrame(s).T)
In [109]: pd.concat(res)
Out[109]:
animals games people
0 1 1 0
0 0 0 2
0 0 0 0
0 0 2 1
0 0 1 1
Upvotes: 1