Reputation: 321
I have a list of 'words' I want to count below
word_list = ['one','two','three']
And I have a column within pandas dataframe with text below.
TEXT
-----
"Perhaps she'll be the one for me."
"Is it two or one?"
"Mayhaps it be three afterall..."
"Three times and it's a charm."
"One fish, two fish, red fish, blue fish."
"There's only one cat in the hat."
"One does not simply code into pandas."
"Two nights later..."
"Quoth the Raven... nevermore."
The desired output that I would like is the following below, where I want to count the number of times the substrings defined in word_list appear in the strings of each row in the dataframe.
Word | Count
one 5
two 3
three 2
Is there a way to do this in Python 2.7?
Upvotes: 1
Views: 1371
Reputation: 402263
Use str.extractall
+ value_counts
:
df
text
0 "Perhaps she'll be the one for me."
1 "Is it two or one?"
2 "Mayhaps it be three afterall..."
3 "Three times and it's a charm."
4 "One fish, two fish, red fish, blue fish."
5 "There's only one cat in the hat."
6 "One does not simply code into pandas."
7 "Two nights later..."
8 "Quoth the Raven... nevermore."
rgx = '({})'.format('|'.join(word_list))
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
one 5
two 3
three 2
Name: 0, dtype: int64
Details
rgx
'(one|two|three)'
df.text.str.lower().str.extractall(rgx).iloc[:, 0]
match
0 0 one
1 0 two
1 one
2 0 three
3 0 three
4 0 one
1 two
5 0 one
6 0 one
7 0 two
Name: 0, dtype: object
Performance
# Zero's code
%%timeit
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1000 loops, best of 3: 1.55 ms per loop
# Andy's code
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
long_string.count(w)
10000 loops, best of 3: 132 µs per loop
%%timeit
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
100 loops, best of 3: 2.53 ms per loop
df = pd.concat([df] * 100000)
%%timeit
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1 loop, best of 3: 4.34 s per loop
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
long_string.count(w)
10 loops, best of 3: 151 ms per loop
%%timeit
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
1 loop, best of 3: 4.12 s per loop
Upvotes: 1
Reputation: 76917
Use
In [52]: pd.Series({w: df.TEXT.str.contains(w, case=False).sum() for w in word_list})
Out[52]:
one 5
three 2
two 3
dtype: int64
Or, to count multiple instances in each row
In [53]: pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})
Out[53]:
one 5
three 2
two 3
dtype: int64
Use sort_values
In [55]: s = pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})
In [56]: s.sort_values(ascending=False)
Out[56]:
one 5
two 3
three 2
dtype: int64
Upvotes: 0
Reputation: 375377
I would do this with vanilla python, first join the string:
In [11]: long_string = "".join(df[0]).lower()
In [12]: long_string[:50] # all the words glued up
Out[12]: "perhaps she'll be the one for me.is it two or one?"
In [13]: for w in word_list:
...: print(w, long_string.count(w))
...:
one 5
two 3
three 2
If you want to return a Series, you could use a dict comprehension:
In [14]: pd.Series({w: long_string.count(w) for w in word_list})
Out[14]:
one 5
three 2
two 3
dtype: int64
Upvotes: 2