Reputation: 1
I'm trying to get a count of the number of entries that contain some words but also must not contain other words. To be clear, I want to get an idea of the number of occurrences assuming an eliminating condition is not met. Here's what I have:
import pandas as pd
import re
data = pd.read_csv('rando-file')
vague_series = pd.DataFrame([(data['text'].str.contains('bla1|bla2',
flags=re.IGNORECASE, regex = True))
&
(~data['text'].str.contains('blah3|bla4',
flags=re.IGNORECASE, regex = True))])
vague_count = vague_series.columns[0].sum()
print(vague_count)
Any attempt to count or sum has failed in this instance with an invalid syntax error. removing the columns[0] bit resulted simply in a 0, 1 designation in place of true and false.
Upvotes: 0
Views: 115
Reputation: 78
Could you post data sample for test ?
I try it with a custom sample and it working well:
import pandas as pd
import re
sr = pd.Series(['New_York', 'Lisbon', 'Tokyo', 'Paris', 'Munich'])
idx = ['City 1', 'City 2', 'City 3', 'City 4', 'City 5']
sr.index = idx
result = (sr.str.contains(pat='i[a-z]', regex=True)) & (~sr.str.contains('s[a-z]', regex=True))
print(result.sum())
>>>2
Maybe don't wrap it in DataFrame
and try simply:
vague_series = (data['text'].str.contains('bla1|bla2', flags=re.IGNORECASE, regex=True) &
~data['text'].str.contains('blah3|bla4', flags=re.IGNORECASE, regex=True))
count = vague_series.sum()
Upvotes: 1