Asha_Tir
Asha_Tir

Reputation: 73

How do I get regex to capture one-letter words and multiple letter words?

The following regex pattern does almost everything I need it to do, including catching contractions:

re_pattern = "[a-zA-Z]+\\'?[a-zA-Z]+"

However, if I enter the following code:

sent = "I can't understand what I'm doing wrong or if I made a mistake."

re.findall(re_pattern, sent)

It doesn't pick up one-letter words, such as I or a:

["can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'made', 'mistake']

Upvotes: 1

Views: 59

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

You need to use

re_pattern = r"[a-zA-Z]+(?:'[a-zA-Z]+)?"

See the regex demo and the Python demo:

import re
re_pattern = r"[a-zA-Z]+(?:'[a-zA-Z]+)?"
sent = "I can't understand what I'm doing wrong or if I made a mistake."
print( re.findall(re_pattern, sent) )
# => ['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']

Note: If you needn't extract letter sequences glued to _ or digits, use word boundaries:

re_pattern = r"\b[a-zA-Z]+(?:'[a-zA-Z]+)?\b"

See the regex demo. And if you plan to match any Unicode words:

re_pattern = r"\b[^\W\d_]+(?:'[^\W\d_]+)?\b"

See the regex demo.

Ah, and if you want to also match digits and underscores as part of "words", just use

re_pattern = r"\w+(?:'\w+)*"

The * after (?:'\w+) allows matching words like rock'n'roll.

Upvotes: 2

Daraan
Daraan

Reputation: 3790

You're trying to match at least 2 character words, as the second + also requires at least one match, with an optional ' in between. Changing it to an optional * will do it

>>> re_pattern = "[a-zA-Z]+\\'?[a-zA-Z]*"
>>> re.findall(re_pattern, sent)
['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']

Upvotes: 2

Related Questions