patrick
patrick

Reputation: 4852

Regex-matching a dictionary efficiently in Python

I have a dictionary of word:count pairs in my Python script. From this dictionary, I would like to extract the entries matching any item in a list of strings.

I have found a working solution using regular expressions (see below), but it takes forever (~ 10 hours for a run). I feel there must be a faster way of doings this - do you guys have any input / ideas on how to improve my code?

import re

dicti={'the':20, 'a':10, 'over':2}
regex_list=['the', 'an?'] 

extractddicti= {k:v for k,v in dicti.items() if any (re.match("^"+regex+"$",k) for regex in regex_list)} 

In reality, the dictionary has about 60,000 entries, the regex_list is ~ 1,000. The items in the regex list are regex strings, i.e. contain special characters like ?, parentheses such as (a|b|c), etc. They might match several entries in the dictionary.

Update / Edit

(see the accepted answer for an even better implementation of the same idea)

Following the advice of Keozon and others, I first compiled my regexes like so:

regex_list=['the', 'an?']
regex_list_compiled=[re.compile("^"+i+"$") for i in regex_list]

and then slightly adapted my search function:

extractddicti= {k:v for k,v in dicti.items() if any (re.match(regex,k) for regex in regex_list_compiled)} 

The performance difference is quite breathtaking: A test run with a dictionary of 14800 items and a list of 1,100 regexes took 34 minutes without compilation, slightly less than one (!) minute with compilation. Did not expect it to be so dramatic. Thanks for all your help!

Upvotes: 2

Views: 19867

Answers (3)

caiohamamura
caiohamamura

Reputation: 2718

What about joining the dict.keys in a single string to reduce number of loops? It seems faster here:

import re, random, string

#Generate random dictionary
dicti={''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5)):i for i in range(1000000)}

#Using ; as separator, you can find any match with [^;]
regex_list=['TH[^;]*', 'A[^;]*'] 
joined_keys = ';' + ';'.join(dicti.keys()) + ';'

extractddicti = {i[1:-1]:dicti[i[1:-1]] for sublist in 
                [re.findall(';'+k+';', joined_keys) for k in regex_list] 
                for i in sublist}

Timeit results for 10 loops:

╔════════════╦════════════════════╗
║ Algorithm  ║     Time (sec)     ║
╠════════════╬════════════════════╣
║ Mine       ║ 2.116118321594911  ║
║ E. Gordon  ║ 20.956314717412653 ║
╚════════════╩════════════════════╝

UPDATE

As @E. Gordon suggested you should change separator to newline \n, then you are able to use ^ and $ special operators with re.MULTILINE.

regex_list=['.*TH', 'AN.*'] 

joined_keys = '\n'.join(dicti.keys())   
all_regex = "^" + "$|^".join(regex_list) + "$" 
matched_keys = re.findall(all_regex, joined_keys, re.MULTILINE)

dicti_match = {k:dicti[k] for k in matched_keys}

Upvotes: 3

topkara
topkara

Reputation: 896

Your problem looks very much like the ad blocking problem. The Easylist is similar to your regex list but includes urls and url patterns.

In general regular expression matching is relatively fast. Yet, as you have already seen, it is not as fast as not running it, also common regular expression libraries are not optimized for large disjunctions like yours.

Your regex dictionary also contains full words. A Trie or, if memory is an issue, a DAWG would be perfect to match the full word portion of the regex dictionary.

The remaining problem is matching a large input to a large set of regexes, and a pre-filtering approach works very well in such cases: first check if there is a plausible match, and only if there is a plausible match then run the full regex match. i.e. in order for the 'an?' regex to match, the string has to have 'an' in it. Suffix trees are perfect for searching such substrings. You probably would want to add a string begin and string end marker if your regexes are always full matches. Of course you could also build a trie or dawg of the set of fixed strings and start a new search at every character. The same effect can be implemented more efficiently with a deterministic finite state automaton using a powerful automaton library.

Hopefully, this method will get rid of the need to run regex matching for a large portion of your input. Once you have to run the regular expression match, you can go with the python regex implementations in other answers.

Upvotes: 2

emmagordon
emmagordon

Reputation: 1222

Compiling the regular expressions once rather than each time you use them would probably gain you a fair bit of performance improvement. So you'd have something like:

import re

dicti={'the':20, 'a':10, 'over':2}
patterns=['the', 'an?'] 
regex_matches = [re.compile("^"+pattern+"$").match for pattern in patterns]

extractddicti= {k:v for k,v in dicti.items()
                if any (regex_match(k) for regex_match in regex_matches)} 

Upvotes: 3

Related Questions