Reputation: 806
I would like to count how many sentences contain specific words per date. For example:
Date Sentences
2020-10-22 Word1 bla bla bla Word2
2020-10-22 Bla bla bla bla
2020-10-22 Word3 bla bla
2020-10-22 Word1 bla bla bla
2020-10-23 Word3 bla
2020-10-23 Word1 bla bla
...
Where my words to search are identified with Wordx (it is just an example, they are words like trump, money, and others), while bla bla bla is just other text (for example, will not win,...). Per each word, Word1, Word3,..., I would like to have the number of times it was used in sentences in a specific date. My approach is to create a list of specific words which I want to look for in my sentences and use a groupby date to look at them through dates, summing how many times each word was used through time. So something like this:
Mylist=[‘word1’,’word3’]
df[‘Sentences’].str.contains(‘|’, Mylist).groupby([‘Date’]).sum()
I am not getting the expected output, so I think I have written something wrong in my code or maybe it is the wrong approach to his problem.
I would like to have something like this:
Word1 Freq
2020-10-22 2
2020-10-23 1
Word3 Freq
2020-10-22 1
2020-10-23 1
Upvotes: 0
Views: 85
Reputation: 5183
You are in the right path, just minor syntax corrections:
Date
after you've selected a single column, Date
must be the index.pattern-to-be-found
as the first parameter, be it a character sequence or a regular expression.### sample data, always provide a callable line of code
### or a clean table that can be parsed by `read_clipboard`
df = pd.DataFrame({
'Date': ['10/22', '10/22', '10/22', '10/22', '10/23', '10/23'],
'Sentences': [
'Word1 bla bla bla Word2',
'Bla bla bla bla',
'Word3 bla bla',
'Word1 bla bla bla',
'Word3 bla',
'Word1 bla bla'
]
}).set_index('Date')
Mylist = ['Word1', 'Word3']
out = { # dict comprehension
### actual solution
word: df['Sentences'].str.contains(word).groupby('Date').sum()
for word in Mylist
}
for key, val in out.items():
print(key)
print(val, '\n')
Output
Word1
Date
10/22 2
10/23 1
Name: Sentences, dtype: int64
Word3
Date
10/22 1
10/23 1
Name: Sentences, dtype: int64
Notice this solution would not count multiple occurrences of a word in the same sentence, as in "Word1 should return two Word1 counts"
, which would only return True
and the groubpy
call would take it as one.
If you wish to count multiple occurrences in the same sentence you can use series.str.findall as in:
df['Sentences'].str.findall(word).map(len).groupby('Date').sum()
Upvotes: 1