Reputation: 239
I would like to find words of length >= 1 which may contain a '
or a -
within. Here is a test string:
a quake-prone area- (aujourd'hui-
In Python, I'm currently using this regex:
string = "a quake-prone area- (aujourd'hui-"
RE_WORDS = re.compile(r'[a-z]+[-\']?[a-z]+')
words = RE_WORDS.findall(string)
I would like to get this result:
>>> words
>>> [u'a', u'quake-prone', u'area', u"aujourd'hui"]
but I get this instead:
>>> words
>>> [u'quake-prone', u'area', u"aujourd'hui"]
Unfortunately, because of the last +
quantifier, it skips all words of length 1. If I use the *
quantifier, it will find a
but also area-
instead of area
.
Then how could create a conditional regex saying: if the word contains an apostrophe or an hyphen, use the + quantifier else use the * quantifier
?
Upvotes: 0
Views: 118
Reputation: 174816
I suggest you to change the last [-\']?[a-z]+
part as optional by putting it into a group and then adding a ?
quantifier next to that group.
>>> string = "a quake-prone area- (aujourd'hui-"
>>> RE_WORDS = re.compile(r'[a-z]+(?:[-\'][a-z]+)?')
>>> RE_WORDS.findall(string)
['a', 'quake-prone', 'area', "aujourd'hui"]
Reason for why the a
is not printed is because of your regex contains two [a-z]+
which asserts that there must be atleast two lowercase letters present in the match.
Note that the regex i mentioned won't match area-
because (?:[-\'][a-z]+)?
optional group asserts that there must be atleast one lowercase letter would present just after to the -
symbol. If no, then stop matching until it reaches the hyphen. So that you got area
at the output instead of area-
because there isn't an lowercase letter exists next to the -
. Here it stops matching until it finds an hyphen without following lowercase letter.
Upvotes: 1