Leggerless
Leggerless

Reputation: 321

Finding occurrences of substrings within pandas dataframe -- Python

I have a list of 'words' I want to count below

word_list = ['one','two','three']

And I have a column within pandas dataframe with text below.

TEXT
-----
"Perhaps she'll be the one for me."
"Is it two or one?"
"Mayhaps it be three afterall..."
"Three times and it's a charm."
"One fish, two fish, red fish, blue fish."
"There's only one cat in the hat."
"One does not simply code into pandas."
"Two nights later..."
"Quoth the Raven... nevermore."

The desired output that I would like is the following below, where I want to count the number of times the substrings defined in word_list appear in the strings of each row in the dataframe.

Word | Count
one        5     
two        3     
three      2 

Is there a way to do this in Python 2.7?

Upvotes: 1

Views: 1371

Answers (3)

cs95
cs95

Reputation: 402263

Use str.extractall + value_counts:

df

                                         text
0         "Perhaps she'll be the one for me."
1                         "Is it two or one?"
2           "Mayhaps it be three afterall..."
3             "Three times and it's a charm."
4  "One fish, two fish, red fish, blue fish."
5          "There's only one cat in the hat."
6     "One does not simply code into pandas."
7                       "Two nights later..."
8             "Quoth the Raven... nevermore."

rgx = '({})'.format('|'.join(word_list))
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()

one      5
two      3
three    2
Name: 0, dtype: int64

Details

rgx
'(one|two|three)'

df.text.str.lower().str.extractall(rgx).iloc[:, 0]

   match
0  0          one
1  0          two
   1          one
2  0        three
3  0        three
4  0          one
   1          two
5  0          one
6  0          one
7  0          two
Name: 0, dtype: object

Performance

Small

# Zero's code 
%%timeit 
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1000 loops, best of 3: 1.55 ms per loop
# Andy's code
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
     long_string.count(w)

10000 loops, best of 3: 132 µs per loop
%%timeit
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
100 loops, best of 3: 2.53 ms per loop

Large

df = pd.concat([df] * 100000)
%%timeit 
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1 loop, best of 3: 4.34 s per loop
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
    long_string.count(w)

10 loops, best of 3: 151 ms per loop
%%timeit 
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
1 loop, best of 3: 4.12 s per loop

Upvotes: 1

Zero
Zero

Reputation: 76917

Use

In [52]: pd.Series({w: df.TEXT.str.contains(w, case=False).sum() for w in word_list})
Out[52]:
one      5
three    2
two      3
dtype: int64

Or, to count multiple instances in each row

In [53]: pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})
Out[53]:
one      5
three    2
two      3
dtype: int64

Use sort_values

In [55]: s = pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})

In [56]: s.sort_values(ascending=False)
Out[56]:
one      5
two      3
three    2
dtype: int64

Upvotes: 0

Andy Hayden
Andy Hayden

Reputation: 375377

I would do this with vanilla python, first join the string:

In [11]: long_string = "".join(df[0]).lower()

In [12]: long_string[:50]  # all the words glued up
Out[12]: "perhaps she'll be the one for me.is it two or one?"

In [13]: for w in word_list:
     ...:     print(w, long_string.count(w))
     ...:
one 5
two 3
three 2

If you want to return a Series, you could use a dict comprehension:

In [14]: pd.Series({w: long_string.count(w) for w in word_list})
Out[14]:
one      5
three    2
two      3
dtype: int64

Upvotes: 2

Related Questions