Reputation: 69
I have 100 keywords in df1,and 10,000 articles in df2. I want to calculate how many articles contain a certain keyword. For example, there are about 20 articles contain keyword "apple".
I try to use df.str.contains(),but I have to calculate every keyword. Can you tell me a efficient way?
df1=pd.DataFrame(['apple','mac','pc','ios','lg'],columns=['keywords'])
df2=pd.DataFrame(['apple is good for health','mac is another pc','today is sunday','Star wars pc game','ios is a system,lg is not','lg is a japan company '],columns=['article'])
result:
1 artricl contain "apple"
1 article contain 'mac'
2 article contain 'pc'
1 article contain "ios"
2 article contain 'lg'
Upvotes: 1
Views: 56
Reputation: 863701
I think need str.contains
for boolean Series with sum
for count True
s, which are processes like 1
s, for all keywords
use list comprehension
with DataFrame
contructor:
L = [(x, df2['article'].str.contains(x).sum()) for x in df1['keywords']]
#alternative solution
#L = [(x, sum(x in article for article in df2['article'])) for x in df1['keywords']]
df3 = pd.DataFrame(L, columns=['keyword', 'count'])
print (df3)
keyword count
0 apple 1
1 mac 1
2 pc 2
3 ios 1
4 lg 2
If want only print output:
for x in df1['keywords']:
count = df2['article'].str.contains(x).sum()
#another solution if no NaNs with sum, generator and check membership by in
#count = sum(x in article for article in df2['article'])
print ('{} article contain "{}"'.format(count, x))
1 article contain "apple"
1 article contain "mac"
2 article contain "pc"
1 article contain "ios"
2 article contain "lg"
Upvotes: 2