Reputation: 13700

Regex search multiple suffixes

I have a big list of target words I am searching

words = ['Word1', 'Word2', 'Word3']

I've been told that a regular expression of this sort:

suffix = re.compile('(?:{words}) (\\w+)'.format(words='|'.join(words)))

Is pretty efficient, since it fails the regex evaluation immediately when a character that doesn't match the expression is met.

However, the other way around is not efficient:

prefix = re.compile('(\\w+) (?:{words})'.format(words='|'.join(words)))

Is there an elegant way to instruct python's regex to do the search in reverse ?

Edit

I've been asked to add example usages:

# prefix search
title = re.compile('(?:Mr.|Mrs.|Ms.|Dr. |Lt.) (\\w+)')
# suffix search
company = re.compile('(\\w+) (?:Inc.| LLP.|ltd.|GMBH)')
# invoking the regex
all_people_names  = title.findall(document)
all_company_names = company.findall(document)

Edit 2

A lot of people had been skeptical regarding the significance of the timing differences.

I've implemented the 2 methods: endswith() and endswith_rev() that reverses the string and the results as kabanus suggested.

These are the results:

As you can see, it makes a huge difference, even with a small amount of suffixes.

Upvotes: 4

Answers (3)

geosmart

Reputation: 666

try .*\.(?:jpg|gif|png)

will match

1.jpg
b.png
c.gif

test it in https://regex101.com/

Non-capturing group (?:jpg|gif|png)
1st Alternative jpg
jpg matches the characters jpg literally (case sensitive)
2nd Alternative gif
gif matches the characters gif literally (case sensitive)
3rd Alternative png
png matches the characters png literally (case sensitive)
Global pattern flags 
g modifier: global. All matches (don't return after first match)

Upvotes: 0

isapir

Reputation: 23592

Using Regular Expression is OK in some cases, or required in others, e.g. when you configure a system that allows you to match patterns and the input type is a RegEx pattern, but for this simple use case RegEx just wastes CPU cycles.

This use case is simple because you know the position at which you want to match the subbstrings - they are always at the end of the input, so each suffix either matches the given inputString or not:

inputString[ len(inputString) - len(suffix) : ] == suffix

But of course, you already have the Python method endswith(suffix), so you can test with:

inputString.endswith( suffix )

The suffix argument can be a tuple though, so you can do the following:

suffixes = ( "Inc.", "inc.", "Gmbh", "ltd.", "LTD", "LLP" )
inputString.endswith( suffixes )

Or for a case insensitive search:

suffixes = ( "inc.", "gmbh", "ltd.", "llp" )
inputString.lower().endswith( suffixes )

Anyway, if performance is really important then perhaps Python is not the best language.

Upvotes: 1

kabanus

Reputation: 25980

Well, the way you did it you have to test all the possible prefixes up to the suffix. One way to beat this, only if the string is long enough, is to reverse everything, so you get back to your first example:

prefix = re.compile('(?:{words}) (\\w+)'.format(words='|'.join([word[::-1] for word in words])))
re.match(prefix,mystring[::-1])

so you are searching from the end, and get back the same pattern - remember to reverse the matches though. ~~I wonder how long does the list of words and string need to be to make this worth it.~~ Apparently this is a major optimization booster see OP for some timing.

Upvotes: 2

Regex search multiple suffixes

Edit

Edit 2

Answers (3)

Related Questions