DrakeMurdoch
DrakeMurdoch

Reputation: 859

How to do a count the number of rows of string data containing above a certain amount of words

I have a dataframe df that has a column containing text df['text'] (articles from a newspaper, in this case). How can I get a count of the rows in df['text'] that have a word count above some threshold of n words?

An example of df is shown below. Each article can contain an arbitrary number of words.

print(df['text'].head(10))

0    Emerging evidence that Mexico economy was back...
1    Chrysler Corp Tuesday announced million in new...
2    CompuServe Corp Tuesday reported surprisingly ...
3    CompuServe Corp Tuesday reported surprisingly ...
4    If dining at Planet Hollywood made you feel li...
5    Hog prices fell Tuesday after government slaug...
6    Blue chip stocks rallied Tuesday after the Fed...
7    Sprint Corp Tuesday announced plans to offer I...
8    Shoppers are loading up this year on perennial...
9    Kansas and Arizona filed lawsuits against some...
Name: text, dtype: object

My goal with this data is to find a count of the articles that contain greater than n words. See the psuedocode below for an example.

n = 250 # number of words cutoff for counting
counter = 0

for row in df['text']:
    if df['text'].wordcount >= n: # wordcount is some function on a df that counts the words in a string for one row
        counter += 1

print(counter)

The desired output is number of articles containing more than n words (in this case, n is arbitrarily set to 250). So, in the psuedocode above, wordcount is some function that counts the words in one row (or, in this case, a single article). Thus for row x if N (number of words in the article) is 340, it would be greater than n, which is set at a threshold of 250. Therefore the if statement would be triggered and counter would increase by one.

Ideally, I would like to do this in a vectorized way, as the dataframe is large. If not, apply works just fine.

Upvotes: 1

Views: 73

Answers (3)

Henry Ecker
Henry Ecker

Reputation: 35626

Assuming that "words" are separated by spaces one approach would be to count the number of spaces between words and add 1. Then compare to the n value.

import pandas as pd

df = pd.DataFrame({
    'text': {0: 'Emerging evidence that Mexico economy was back',
             1: 'Chrysler Corp Tuesday announced million in new',
             2: 'CompuServe Corp Tuesday reported surprisingly',
             3: 'CompuServe Corp Tuesday reported surprisingly',
             4: 'If dining at Planet Hollywood made you feel li',
             5: 'Hog prices fell Tuesday after government slaug',
             6: 'Blue chip stocks rallied Tuesday after the Fed',
             7: 'Sprint Corp Tuesday announced plans to offer I',
             8: 'Shoppers are loading up this year on perennial',
             9: 'Kansas and Arizona filed lawsuits against s'}
})

n = 8

# Words are 1 more than the number of spaces
# Compare greater than equal to n
m = df['text'].str.count(' ').add(1).ge(n)

filtered_df = df[m]
print(filtered_df)

df:

                                             text
0  Emerging evidence that Mexico economy was back  # 7
1  Chrysler Corp Tuesday announced million in new  # 7
2   CompuServe Corp Tuesday reported surprisingly  # 5
3   CompuServe Corp Tuesday reported surprisingly  # 5
4  If dining at Planet Hollywood made you feel li  # 9
5  Hog prices fell Tuesday after government slaug  # 7
6  Blue chip stocks rallied Tuesday after the Fed  # 8
7  Sprint Corp Tuesday announced plans to offer I  # 8
8  Shoppers are loading up this year on perennial  # 8
9     Kansas and Arizona filed lawsuits against s  # 7

filtered:

                                             text
4  If dining at Planet Hollywood made you feel li  # 9
6  Blue chip stocks rallied Tuesday after the Fed  # 8
7  Sprint Corp Tuesday announced plans to offer I  # 8
8  Shoppers are loading up this year on perennial  # 8

If just the number of rows that match are needed use sum on the mask. True values are 1 and False are 0. So the filtered DataFrame doesn't have to be built at all to get the count:

m = df['text'].str.count(' ').add(1).ge(n)

print(m.sum())

Output:

4

Upvotes: 1

Utsav
Utsav

Reputation: 5918

Input Sample

d="""text
Emerging evidence that Mexico economy was back...
Chrysler Corp Tuesday announced million in new...
CompuServe Corp Tuesday reported surprisingly ...
CompuServe Corp Tuesday reported surprisingly ...
If dining at Planet Hollywood made you feel li...
Hog prices fell Tuesday after government slaug...
Blue chip stocks rallied Tuesday after the Fed...
Sprint Corp Tuesday announced plans to offer I...
Shoppers are loading up this year on perennial...
Kansas and Arizona filed lawsuits against some..."""
df=pd.read_csv(StringIO(d))
df

If we want only the count of rows having words greater than n

n=7 # replace it with 250
df[df['text'].str.split().str.len() > n].count()

Output

text    4
dtype: int64

If we want the rows having count greater than n

n=7 # replace it with 250
df[df['text'].str.split().str.len() > n]

Output

    text
4   If dining at Planet Hollywood made you feel li...
6   Blue chip stocks rallied Tuesday after the Fed...
7   Sprint Corp Tuesday announced plans to offer I...
8   Shoppers are loading up this year on perennial...

If we want count of words for each row

df['len'] = df['text'].str.split().str.len()
df

Output

    text                                               len
0   Emerging evidence that Mexico economy was back...   7
1   Chrysler Corp Tuesday announced million in new...   7
2   CompuServe Corp Tuesday reported surprisingly ...   6
3   CompuServe Corp Tuesday reported surprisingly ...   6
4   If dining at Planet Hollywood made you feel li...   9
5   Hog prices fell Tuesday after government slaug...   7
6   Blue chip stocks rallied Tuesday after the Fed...   8
7   Sprint Corp Tuesday announced plans to offer I...   8
8   Shoppers are loading up this year on perennial...   8
9   Kansas and Arizona filed lawsuits against some...   7

Upvotes: 1

ThePyGuy
ThePyGuy

Reputation: 18416

If you just want to filter on word count, split the texts on space and compare the length of the resulting list.

>>> df[df['text'].str.split().apply(len)>=8]
                                             text
4  If dining at Planet Hollywood made you feel li
6  Blue chip stocks rallied Tuesday after the Fed
7  Sprint Corp Tuesday announced plans to offer I
8  Shoppers are loading up this year on perennial

If you want to filter on unique word count, you may want to convert the resulting list to a set after split

>>> df[df['text'].str.split().apply(set).apply(len)>=8]
                                             text
4  If dining at Planet Hollywood made you feel li
6  Blue chip stocks rallied Tuesday after the Fed
7  Sprint Corp Tuesday announced plans to offer I
8  Shoppers are loading up this year on perennial

Upvotes: 0

Related Questions