Reputation: 297
I'm trying to get the count of words (strings) in a pandas dataframe column that are found in another column in any order.
I've tried the following, which is close, but it doesn't count the occurrences (It only tells me if the words were found in any order).
words='|'.join(df['Cluster Name'].unique())
df['frequency']=df['Keyword'].str.contains(words).astype(int)
Minimum Reproducible Example:
data = {'Keyword' : ['Nike', 'Nike Socks', 'Nike Stripy Socks', 'Socks Nike', 'Adidas Socks'],
'Cluster' : ['Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks']}
# Create DataFrame
df = pd.DataFrame(data)
expected output
Keyword Cluster Frequency
0 Nike Nike Socks 1
1 Nike Socks Nike Socks 2
2 Nike Stripy Socks Nike Socks 2
3 Socks Nike Nike Socks 2
4 Adidas Socks Nike Socks 1
Upvotes: 2
Views: 413
Reputation: 260300
You're going to have to use apply
anyway, so you can directly use set
intersection:
df['Frequency'] = df.apply(lambda x: len(set(x['Keyword'].split()).intersection(x['Cluster'].split())), axis=1)
Output:
Keyword Cluster Frequency
0 Nike Nike Socks 1
1 Nike Socks Nike Socks 2
2 Nike Stripy Socks Nike Socks 2
3 Socks Nike Nike Socks 2
4 Adidas Socks Nike Socks 1
Upvotes: 1
Reputation: 323226
We can explode
then count the occurrence for word , then sum
the back
x = df.assign(Keyword = df.Keyword.str.split(' ')).explode('Keyword')
df['freq'] = x.apply(lambda y : y['Keyword'] in y['Cluster'],axis=1).groupby(level=0).sum()
df
Keyword Cluster freq
0 Nike Nike Socks 1
1 Nike Socks Nike Socks 2
2 Nike Stripy Socks Nike Socks 2
3 Socks Nike Nike Socks 2
4 Adidas Socks Nike Socks 1
Upvotes: 1
Reputation: 735
My answer is similar to @Derek, but it will work properly if the words in Cluster
column are separated not only with whitespases
from re import findall
import pandas as pd
def count_corresponding(row):
keywords = row.Keyword.split(' ')
count = sum([len(findall(keyword,row.Cluster)) for keyword in keywords])
return count
data = {'Keyword' : ['Nike', 'Nike Socks', 'Nike Stripy Socks', 'Socks Nike', 'Adidas Socks'],
'Cluster' : ['Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks']}
df = pd.DataFrame(data)
df['Frequency'] = df.apply(count_corresponding, axis=1)
Upvotes: 1
Reputation: 19545
You can create a custom function that takes a row as an input, and then apply
it to the dataframe rowwise using the argument axis=1
:
def count_keywords(row):
freq = 0
for word in row['Keyword'].split(" "):
if word in row['Cluster']:
freq += 1
return freq
df['Frequency'] = df.apply(lambda row: count_keywords(row), axis=1)
Output:
>>> df
Keyword Cluster Frequency
0 Nike Nike Socks 1
1 Nike Socks Nike Socks 2
2 Nike Stripy Socks Nike Socks 2
3 Socks Nike Nike Socks 2
4 Adidas Socks Nike Socks 1
Upvotes: 1