Reputation: 859
I have a dataframe df
that has a column containing text df['text']
(articles from a newspaper, in this case). How can I get a count of the rows in df['text']
that have a word count above some threshold of n
words?
An example of df
is shown below. Each article can contain an arbitrary number of words.
print(df['text'].head(10))
0 Emerging evidence that Mexico economy was back...
1 Chrysler Corp Tuesday announced million in new...
2 CompuServe Corp Tuesday reported surprisingly ...
3 CompuServe Corp Tuesday reported surprisingly ...
4 If dining at Planet Hollywood made you feel li...
5 Hog prices fell Tuesday after government slaug...
6 Blue chip stocks rallied Tuesday after the Fed...
7 Sprint Corp Tuesday announced plans to offer I...
8 Shoppers are loading up this year on perennial...
9 Kansas and Arizona filed lawsuits against some...
Name: text, dtype: object
My goal with this data is to find a count of the articles that contain greater than n
words. See the psuedocode below for an example.
n = 250 # number of words cutoff for counting
counter = 0
for row in df['text']:
if df['text'].wordcount >= n: # wordcount is some function on a df that counts the words in a string for one row
counter += 1
print(counter)
The desired output is number of articles containing more than n
words (in this case, n
is arbitrarily set to 250). So, in the psuedocode above, wordcount
is some function that counts the words in one row (or, in this case, a single article). Thus for row x
if N
(number of words in the article) is 340, it would be greater than n
, which is set at a threshold of 250. Therefore the if
statement would be triggered and counter
would increase by one.
Ideally, I would like to do this in a vectorized way, as the dataframe is large. If not, apply
works just fine.
Upvotes: 1
Views: 73
Reputation: 35626
Assuming that "words" are separated by spaces one approach would be to count the number of spaces between words and add 1. Then compare to the n value.
import pandas as pd
df = pd.DataFrame({
'text': {0: 'Emerging evidence that Mexico economy was back',
1: 'Chrysler Corp Tuesday announced million in new',
2: 'CompuServe Corp Tuesday reported surprisingly',
3: 'CompuServe Corp Tuesday reported surprisingly',
4: 'If dining at Planet Hollywood made you feel li',
5: 'Hog prices fell Tuesday after government slaug',
6: 'Blue chip stocks rallied Tuesday after the Fed',
7: 'Sprint Corp Tuesday announced plans to offer I',
8: 'Shoppers are loading up this year on perennial',
9: 'Kansas and Arizona filed lawsuits against s'}
})
n = 8
# Words are 1 more than the number of spaces
# Compare greater than equal to n
m = df['text'].str.count(' ').add(1).ge(n)
filtered_df = df[m]
print(filtered_df)
df
:
text
0 Emerging evidence that Mexico economy was back # 7
1 Chrysler Corp Tuesday announced million in new # 7
2 CompuServe Corp Tuesday reported surprisingly # 5
3 CompuServe Corp Tuesday reported surprisingly # 5
4 If dining at Planet Hollywood made you feel li # 9
5 Hog prices fell Tuesday after government slaug # 7
6 Blue chip stocks rallied Tuesday after the Fed # 8
7 Sprint Corp Tuesday announced plans to offer I # 8
8 Shoppers are loading up this year on perennial # 8
9 Kansas and Arizona filed lawsuits against s # 7
filtered
:
text
4 If dining at Planet Hollywood made you feel li # 9
6 Blue chip stocks rallied Tuesday after the Fed # 8
7 Sprint Corp Tuesday announced plans to offer I # 8
8 Shoppers are loading up this year on perennial # 8
If just the number of rows that match are needed use sum
on the mask. True values are 1 and False are 0. So the filtered DataFrame doesn't have to be built at all to get the count:
m = df['text'].str.count(' ').add(1).ge(n)
print(m.sum())
Output:
4
Upvotes: 1
Reputation: 5918
Input Sample
d="""text
Emerging evidence that Mexico economy was back...
Chrysler Corp Tuesday announced million in new...
CompuServe Corp Tuesday reported surprisingly ...
CompuServe Corp Tuesday reported surprisingly ...
If dining at Planet Hollywood made you feel li...
Hog prices fell Tuesday after government slaug...
Blue chip stocks rallied Tuesday after the Fed...
Sprint Corp Tuesday announced plans to offer I...
Shoppers are loading up this year on perennial...
Kansas and Arizona filed lawsuits against some..."""
df=pd.read_csv(StringIO(d))
df
n=7 # replace it with 250
df[df['text'].str.split().str.len() > n].count()
Output
text 4
dtype: int64
n=7 # replace it with 250
df[df['text'].str.split().str.len() > n]
Output
text
4 If dining at Planet Hollywood made you feel li...
6 Blue chip stocks rallied Tuesday after the Fed...
7 Sprint Corp Tuesday announced plans to offer I...
8 Shoppers are loading up this year on perennial...
df['len'] = df['text'].str.split().str.len()
df
Output
text len
0 Emerging evidence that Mexico economy was back... 7
1 Chrysler Corp Tuesday announced million in new... 7
2 CompuServe Corp Tuesday reported surprisingly ... 6
3 CompuServe Corp Tuesday reported surprisingly ... 6
4 If dining at Planet Hollywood made you feel li... 9
5 Hog prices fell Tuesday after government slaug... 7
6 Blue chip stocks rallied Tuesday after the Fed... 8
7 Sprint Corp Tuesday announced plans to offer I... 8
8 Shoppers are loading up this year on perennial... 8
9 Kansas and Arizona filed lawsuits against some... 7
Upvotes: 1
Reputation: 18416
If you just want to filter on word count, split
the texts on space and compare the length
of the resulting list.
>>> df[df['text'].str.split().apply(len)>=8]
text
4 If dining at Planet Hollywood made you feel li
6 Blue chip stocks rallied Tuesday after the Fed
7 Sprint Corp Tuesday announced plans to offer I
8 Shoppers are loading up this year on perennial
If you want to filter on unique word count
, you may want to convert the resulting list
to a set
after split
>>> df[df['text'].str.split().apply(set).apply(len)>=8]
text
4 If dining at Planet Hollywood made you feel li
6 Blue chip stocks rallied Tuesday after the Fed
7 Sprint Corp Tuesday announced plans to offer I
8 Shoppers are loading up this year on perennial
Upvotes: 0