ccsv
ccsv

Reputation: 8669

Python pandas count number of Regex matches in a string

I have a dataframe with sentences and a dictionary of terms grouped into topics, where I want to count the number of term matches for each topic.

import pandas as pd

terms = {'animals':["fox","deer","eagle"],
'people':['John', 'Rob','Steve'],
'games':['basketball', 'football', 'hockey']
}

df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['The quick brown fox was playing basketball today','John and Rob visited the eagles nest, the foxes ran away','Bill smells like a wet dog','Steve threw the football at a deer. But the football missed','Sheriff John does not like hockey']
})

So far I have created columns for the topics and marked it with 1 if a word is present by iterating over the dictionary.

df = pd.concat([df, pd.DataFrame(columns=list(terms.keys()))])


for k, v in terms.items():
    for val in v:
        df.loc[df.Foo.str.contains(val), k] = 1


print (df)

and I get:

>>> 
                                                 Foo  Score animals games  \
0   The quick brown fox was playing basketball today      4       1     1   
1  John and Rob visited the eagles nest, the foxe...      6       1   NaN   
2                         Bill smells like a wet dog      2     NaN   NaN   
3  Steve threw the football at a deer. But the fo...      7       1     1   
4                  Sheriff John does not like hockey      8     NaN     1   

  people  
0    NaN  
1      1  
2    NaN  
3      1  
4      1  

What is the best way to count the number of words for each topic that appears in the sentence? and is there a more efficient way of looping over the dictionary without using cython?

Upvotes: 3

Views: 3664

Answers (2)

jezrael
jezrael

Reputation: 863226

You can use split with stack what is 5 times faster as Counter solution:

df1 = df.Foo.str.split(expand=True).stack()
                                   .reset_index(level=1, drop=True)
                                   .reset_index(name='Foo')

for k, v in terms.items():
    df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
#print df1

print df1.groupby('index').sum().astype(int)
       games  animals  people
index                        
0          1        1       0
1          0        2       2
2          0        0       0
3          2        1       1
4          1        0       1

Timings:

In [233]: %timeit a(df)
100 loops, best of 3: 4.9 ms per loop

In [234]: %timeit b(df)
10 loops, best of 3: 25.2 ms per loop

Code:

def a(df):
    df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')
    for k, v in terms.items():
        df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
    return df1.groupby('index').sum().astype(int)

def b(df):
    from collections import Counter

    df1 = pd.DataFrame(terms)

    res = []
    for i,r in df.iterrows():
        s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
        res.append(pd.DataFrame(s).T)
    return pd.concat(res)

Upvotes: 2

Colonel Beauvel
Colonel Beauvel

Reputation: 31181

I would go for Counter and replace:

from collections import Counter

df1 = pd.DataFrame(terms)

res = []
for i,r in df.iterrows():
    s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
    res.append(pd.DataFrame(s).T)


In [109]: pd.concat(res)
Out[109]:
   animals  games  people
0        1      1       0
0        0      0       2
0        0      0       0
0        0      2       1
0        0      1       1

Upvotes: 1

Related Questions