hmghaly
hmghaly

Reputation: 1502

python regex suffix matching

for a typical set of word suffixes (ize,fy,ly,able...etc), I want to know if a given words ends with any of them, and subsequently remove them. I know this can be done iteratively with word.endswith('ize') for example, but I believe there is a neater regex way of doing it.. tried positive lookahead with an ending marker $ but for some reason didn't work:

pat='(?=ate|ize|ify|able)$'
word='terrorize'
re.findall(pat,word)

Upvotes: 1

Views: 9650

Answers (5)

Ned Batchelder
Ned Batchelder

Reputation: 375484

Little-known fact: endswith accepts a tuple of possibilities:

if word.endswith(('ate','ize','ify','able')):
    #...

Unfortunately, it doesn't indicate which string was found, so it doesn't help with removing the suffix.

Upvotes: 5

Hui Zheng
Hui Zheng

Reputation: 10224

You need adjust parenthese, just change pat from:

(?=ate|ize|ify|able)$

to:

(?=(ate|ize|ify|able)$)

If you need remove the suffixes later, you could use the pattern:

^(.*)(?=(ate|ize|ify|able)$)

Test in REPL:

>>> pat = '^(.*)(?=(ate|ize|ify|able)$)'
>>> word = 'terrorize'
>>> re.findall(pat, word)
[('terror', 'ize')]

Upvotes: 1

pochen
pochen

Reputation: 873

What you are looking for is actually (?:)
Check this out:

re.sub(r"(?:ate|ize|ify|able)$", "", "terrorize")

Have a look at this site Regex.
There are tones of useful regex skills. Hope you enjoy it.

BTW, the python library itself is a neat & wonderful tutorial.
I do help() a lot :)

Upvotes: 2

kjetilh
kjetilh

Reputation: 4976

If it's word-by-word matching then simply remove the look-ahead check, the $ caret is sufficient.

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1121256

A lookahead is an anchor pattern, just like ^ and $ anchor matches to a specific location but are not themselves a match.

You want to match these suffixes, but at the end of a word, so use the word-edge anchor \b instead:

r'(ate|ize|ify|able)\b'

then use re.sub() to replace those:

re.sub(r'(ate|ize|ify|able)\b', '', word)

which works just fine:

>>> word='terrorize'
>>> re.sub(r'(ate|ize|ify|able)\b', '', word)
'terror'

Upvotes: 2

Related Questions