ceoneill
ceoneill

Reputation: 11

Count instances of string in separate dataframe, not exact match, using Python, pandas

I have a list of names in df1 and I need to see if they match anywhere in df2. I know I probably need to use str.contains on each item and add one to a count, but I haven't figured out how to do this successfully.

for e in df2['People_separate']:
count = df1['People'].str.contains(e)
if count == True:
    count += 1
return count

example: df1:

| People    | 
| --------  | 
| A B / E F | 
| A B / C D | 
| E F       |

df2 (looking to populate the 'counts' column:

| People_separate | Counts |
| --------------- | -------------|
| A B             | 2            |
| C D             | 1            |
| E F             | 2            |

Upvotes: 1

Views: 198

Answers (2)

ALollz
ALollz

Reputation: 59579

If the "not exact match" is really a requirement, then we form a search pattern to use with Series.str.extractall, and take the value_counts of that extraction. This way if your search word is 'foo' a word like 'foobar' will still count as a match (because it contains 'foo').

The reindex ensures the resulting Series also shows 0s for words that never matched.

import pandas as pd

df1 = pd.DataFrame({"People": ['A B / E F', 'A B / C D', 'E F']})
df2 = pd.DataFrame({"People_separate": ['A B', 'C D', 'E F', 'banana']})

pat = '(' + '|'.join(df2['People_separate']) + ')'
#(A B|C D|E F)

(df1['People'].str.extractall(pat)[0]
   .value_counts()
   .reindex(df2['People_separate'], fill_value=0))

People_separate
A B       2
C D       1
E F       2
banana    0
Name: 0, dtype: int64

Upvotes: 1

perl
perl

Reputation: 9941

You can split the rows by ' / ' with split, then explode to convert lists into rows, and then count values with value_counts:

df['People'].str.split(' / ').explode().value_counts()

Output:

A B    2
E F    2
C D    1

Upvotes: 1

Related Questions