Andy Adam
Andy Adam

Reputation: 69

Pandas:check values in column A contained in column B

I have 100 keywords in df1,and 10,000 articles in df2. I want to calculate how many articles contain a certain keyword. For example, there are about 20 articles contain keyword "apple".

I try to use df.str.contains(),but I have to calculate every keyword. Can you tell me a efficient way?

df1=pd.DataFrame(['apple','mac','pc','ios','lg'],columns=['keywords'])


df2=pd.DataFrame(['apple is good for health','mac is another pc','today is sunday','Star wars pc game','ios is a system,lg is not','lg is a japan company '],columns=['article'])

result:

1 artricl contain "apple"
1 article contain 'mac'
2 article contain 'pc'
1 article contain "ios"
2 article contain 'lg'

Upvotes: 1

Views: 56

Answers (1)

jezrael
jezrael

Reputation: 863701

I think need str.contains for boolean Series with sum for count Trues, which are processes like 1s, for all keywords use list comprehension with DataFrame contructor:

L = [(x, df2['article'].str.contains(x).sum()) for x in df1['keywords']]
#alternative solution
#L = [(x, sum(x in article for article in df2['article'])) for x in df1['keywords']]
df3 = pd.DataFrame(L, columns=['keyword', 'count'])
print (df3)
  keyword  count
0   apple      1
1     mac      1
2      pc      2
3     ios      1
4      lg      2

If want only print output:

for x in df1['keywords']:
    count =  df2['article'].str.contains(x).sum()
    #another solution if no NaNs with sum, generator and check membership by in
    #count =  sum(x in article for article in df2['article'])
    print ('{} article contain "{}"'.format(count, x))

1 article contain "apple"
1 article contain "mac"
2 article contain "pc"
1 article contain "ios"
2 article contain "lg"

Upvotes: 2

Related Questions