Reputation: 73
The following regex pattern does almost everything I need it to do, including catching contractions:
re_pattern = "[a-zA-Z]+\\'?[a-zA-Z]+"
However, if I enter the following code:
sent = "I can't understand what I'm doing wrong or if I made a mistake."
re.findall(re_pattern, sent)
It doesn't pick up one-letter words, such as I
or a
:
["can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'made', 'mistake']
Upvotes: 1
Views: 59
Reputation: 626927
You need to use
re_pattern = r"[a-zA-Z]+(?:'[a-zA-Z]+)?"
See the regex demo and the Python demo:
import re
re_pattern = r"[a-zA-Z]+(?:'[a-zA-Z]+)?"
sent = "I can't understand what I'm doing wrong or if I made a mistake."
print( re.findall(re_pattern, sent) )
# => ['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']
Note: If you needn't extract letter sequences glued to _
or digits, use word boundaries:
re_pattern = r"\b[a-zA-Z]+(?:'[a-zA-Z]+)?\b"
See the regex demo. And if you plan to match any Unicode words:
re_pattern = r"\b[^\W\d_]+(?:'[^\W\d_]+)?\b"
See the regex demo.
Ah, and if you want to also match digits and underscores as part of "words", just use
re_pattern = r"\w+(?:'\w+)*"
The *
after (?:'\w+)
allows matching words like rock'n'roll
.
Upvotes: 2
Reputation: 3790
You're trying to match at least 2 character words, as the second + also requires at least one match, with an optional '
in between.
Changing it to an optional * will do it
>>> re_pattern = "[a-zA-Z]+\\'?[a-zA-Z]*"
>>> re.findall(re_pattern, sent)
['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']
Upvotes: 2