Reputation: 13700
I have a big list of target words I am searching
words = ['Word1', 'Word2', 'Word3']
I've been told that a regular expression of this sort:
suffix = re.compile('(?:{words}) (\\w+)'.format(words='|'.join(words)))
Is pretty efficient, since it fails the regex evaluation immediately when a character that doesn't match the expression is met.
However, the other way around is not efficient:
prefix = re.compile('(\\w+) (?:{words})'.format(words='|'.join(words)))
Is there an elegant way to instruct python's regex to do the search in reverse ?
I've been asked to add example usages:
# prefix search
title = re.compile('(?:Mr.|Mrs.|Ms.|Dr. |Lt.) (\\w+)')
# suffix search
company = re.compile('(\\w+) (?:Inc.| LLP.|ltd.|GMBH)')
# invoking the regex
all_people_names = title.findall(document)
all_company_names = company.findall(document)
A lot of people had been skeptical regarding the significance of the timing differences.
I've implemented the 2 methods: endswith()
and endswith_rev()
that reverses the string and the results as kabanus suggested.
These are the results:
As you can see, it makes a huge difference, even with a small amount of suffixes.
Upvotes: 4
Views: 2220
Reputation: 666
try
.*\.(?:jpg|gif|png)
will match
1.jpg
b.png
c.gif
test it in https://regex101.com/
Non-capturing group (?:jpg|gif|png)
1st Alternative jpg
jpg matches the characters jpg literally (case sensitive)
2nd Alternative gif
gif matches the characters gif literally (case sensitive)
3rd Alternative png
png matches the characters png literally (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
Upvotes: 0
Reputation: 23592
Using Regular Expression is OK in some cases, or required in others, e.g. when you configure a system that allows you to match patterns and the input type is a RegEx pattern, but for this simple use case RegEx just wastes CPU cycles.
This use case is simple because you know the position at which you want to match the subbstrings - they are always at the end of the input, so each suffix
either matches the given inputString
or not:
inputString[ len(inputString) - len(suffix) : ] == suffix
But of course, you already have the Python method endswith(suffix)
, so you can test with:
inputString.endswith( suffix )
The suffix
argument can be a tuple
though, so you can do the following:
suffixes = ( "Inc.", "inc.", "Gmbh", "ltd.", "LTD", "LLP" )
inputString.endswith( suffixes )
Or for a case insensitive search:
suffixes = ( "inc.", "gmbh", "ltd.", "llp" )
inputString.lower().endswith( suffixes )
Anyway, if performance is really important then perhaps Python is not the best language.
Upvotes: 1
Reputation: 25980
Well, the way you did it you have to test all the possible prefixes up to the suffix. One way to beat this, only if the string is long enough, is to reverse everything, so you get back to your first example:
prefix = re.compile('(?:{words}) (\\w+)'.format(words='|'.join([word[::-1] for word in words])))
re.match(prefix,mystring[::-1])
so you are searching from the end, and get back the same pattern - remember to reverse the matches though. I wonder how long does the list of words and string need to be to make this worth it. Apparently this is a major optimization booster see OP for some timing.
Upvotes: 2