Reputation: 61
I want to count the number of occurrences of the word 'people' in a text using python. For that I use Counter and Python's regular expression:
for j in range(len(paragraphs)):
text = paragraphs[j].text
count[j] = Counter(re.findall(r'\bpeople\b' ,text))
Yet, here my code does not take into account of the occurrences of people. people! people? How can I modify it to also count the cases when the word is followed by a specific character?
Thank you for you help,
Upvotes: 1
Views: 74
Reputation: 51663
You can use an optional character-group in your regex:
r'\bpeople[.,!?]?\b'
The ? specifies it can occure 0 or 1 times - the []
specifies what characters are allowed. There is no need to escape the .
(or f.e. ()*+?
) inside []
although they have special meaning for regex. If you wanted to use a -
inside []
you would need to escape it as it is used to denote ranges in sets [1-5]
== 12345
.
See: https://docs.python.org/3/library/re.html#regular-expression-syntax
[] Used to indicate a set of characters. In a set:
Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'. Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. [...]
Upvotes: 2
Reputation: 45261
Does it have to use regex? Why not just:
len(text.split("people"))-1
Upvotes: 0
Reputation: 316
You can use a modifier statement at the end of the 'people' part of your Regex pattern. Try the following:
for j in range(len(paragraphs)):
text = paragraphs[j].text
count[j] = Counter(re.findall('r\bpeople[.?!]?\b', text)
The ? is for zero or more quantifier. The above pattern seems to work on regex101.com but I haven't tried in out in a Python shell yet.
Upvotes: 1
Reputation: 1117
people[?.!]
This will allow you to only match with people? people. and/or people!
So if you add a few more Counter(re.finall(
you will be able to do something like this
#This will only match people
count[j] = Counter(re.findall(r'people\s' ,text))
#This will only match people?
count[j] = Counter(re.findall(r'people\?' ,text))
#This will only match people.
count[j] = Counter(re.findall(r'people\.' ,text))
#This will only match people!
count[j] = Counter(re.findall(r'people\!' ,text))
You need to use the \
to escape the special characters
Also this is a good resource when you are experimenting with python regular expressions: https://pythex.org/ The site also has a regular expression cheat sheet
Upvotes: 1