still_learning
still_learning

Reputation: 806

Counting frequency words per date

I would like to count how many sentences contain specific words per date. For example:

Date    Sentences
2020-10-22 Word1 bla bla bla Word2
2020-10-22 Bla bla bla bla
2020-10-22 Word3 bla bla 
2020-10-22 Word1 bla bla bla
2020-10-23 Word3 bla 
2020-10-23 Word1 bla bla 
...

Where my words to search are identified with Wordx (it is just an example, they are words like trump, money, and others), while bla bla bla is just other text (for example, will not win,...). Per each word, Word1, Word3,..., I would like to have the number of times it was used in sentences in a specific date. My approach is to create a list of specific words which I want to look for in my sentences and use a groupby date to look at them through dates, summing how many times each word was used through time. So something like this:

Mylist=[‘word1’,’word3’]
df[‘Sentences’].str.contains(‘|’, Mylist).groupby([‘Date’]).sum()

I am not getting the expected output, so I think I have written something wrong in my code or maybe it is the wrong approach to his problem.

I would like to have something like this:

Word1       Freq
2020-10-22  2
2020-10-23  1

Word3      Freq
2020-10-22 1
2020-10-23 1

Upvotes: 0

Views: 85

Answers (1)

RichieV
RichieV

Reputation: 5183

You are in the right path, just minor syntax corrections:

  1. In order to group by Date after you've selected a single column, Date must be the index.
  2. series.str.contains takes pattern-to-be-found as the first parameter, be it a character sequence or a regular expression.
### sample data, always provide a callable line of code
### or a clean table that can be parsed by `read_clipboard`
df = pd.DataFrame({
    'Date': ['10/22', '10/22', '10/22', '10/22', '10/23', '10/23'],
    'Sentences': [
        'Word1 bla bla bla Word2',
        'Bla bla bla bla',
        'Word3 bla bla',
        'Word1 bla bla bla',
        'Word3 bla',
        'Word1 bla bla'
    ]
}).set_index('Date')

Mylist = ['Word1', 'Word3']
out = { # dict comprehension
    ### actual solution
    word: df['Sentences'].str.contains(word).groupby('Date').sum()
    for word in Mylist
}

for key, val in out.items():
    print(key)
    print(val, '\n')
    

Output

Word1
Date
10/22    2
10/23    1
Name: Sentences, dtype: int64

Word3
Date
10/22    1
10/23    1
Name: Sentences, dtype: int64

Notice this solution would not count multiple occurrences of a word in the same sentence, as in "Word1 should return two Word1 counts", which would only return True and the groubpy call would take it as one.

If you wish to count multiple occurrences in the same sentence you can use series.str.findall as in:

df['Sentences'].str.findall(word).map(len).groupby('Date').sum()

Upvotes: 1

Related Questions