Reputation: 115

Optimized regular expression search in a text corpus in Python

Here is my requirement in python:

I literally load a dictionary - say from /usr/share/dict/words (not to be confused with dictionary type) and use it to search for valid words. At the moment I am doing as follows:

dict_list  = open('dictionary', 'r').read().split()

def search_dictionary(key):
    p=re.compile(key)
    # Comment: Is 'key' a prefix for a valid word in dictionary?
    # If yes, return True, else return False
    tmp_list = [x for x in dict_list if bool(p.match(x))]
    if not tmp_list:
        ....
    else:
        ....

Note that search_dictionary can be called large number of times and this is a bottleneck at the moment. Is there a more efficient way to do this string search? Like say pre-compiling dictionary. One can think of a dictionary attack use case. I am relatively a starter.

EDIT: I have updated the code with comments. As suggested in comments, I am possibly doing more work than needed.

Upvotes: 1

Answers (2)

Antti Haapala -- Слава Україні

Reputation: 133919

Your algorithm runs in O(n) time with large constants. This seems so wrong, when a simple binary search could do O(lg n). If your regex does not contain special characters, why not:

import bisect
with open('dictionary') as f:
    dictionary = f.read().split()

    # .sort is slow, so better to sort
    # words on disk! And/or run many searches
    # for one invocation
    dictionary.sort()

def bisect_search(key):
    i = bisect.bisect_left(dictionary, key)
    if i != len(dictionary):
        return dictionary[i].startswith(key)

    return False

Sorts the array, and then finds the lexicographically "smallest" word for which word >= key and sees if it is a prefix for the given key.

Speed test against Padraic's first letter from dictionary, then linear search:

In [1]: %timeit bisect_search('thu')
1000000 loops, best of 3: 1.07 µs per loop

In [2]: %timeit search_dictionary('thu')
1000 loops, best of 3: 595 µs per loop

with 146880 words, and 5760 words starting with t.

Upvotes: 3

Padraic Cunningham

Reputation: 180401

If you want to stop on the first match use any which will short circuit as soon as we find a match, in your code you always go through every word in the dictionary even if you get a match on the first word, you also build a list unnecessarily:

dict_list  = open('dictionary').read().split()

def search_dictionary(key):
    p = re.compile(key)
    if any(p.match(x) for x in dict_list):
    .....

You should also preferably only be creating your dictionary once not every time you call the function. Define it at the start of your code and pass it as a parameter if needed.

If you want to find prefixes using str.startswith would probably be faster:

if any(x.startswith(key) for x in dict_list):

To optimise the startswith calls:

check = str.startswith
if any(check(x, key) for x in dict_list):

Or if it can appear anywhere just use in:

if any(key in x for x in dict_list):

using cpython2.7 using the optimised str.startswith method seems more efficient:

In [15]: s ="efficient"

In [16]: timeit p.match(s)
1000000 loops, best of 3: 359 ns per loop

In [17]: check = str.startswith

In [18]: timeit check(s,"eff")
1000000 loops, best of 3: 212 ns per loop

The difference for non matches is roughly the same

If you make a an actual dict from your dictionary where the keys are from a-z and the values are lists of words starting with the key, you can use the first letter from each key in your function to do lookups only searching the words that start with the same letter.

from collections import defaultdict
word_dict = defaultdict(list)

with open("/usr/share/dict/words") as f:
    for line in f:
        line = line.rstrip().lower()
        word_dict[line[0]].append(line)

You can see an example output use a key "z":

word_dict["z"]
['z', "z's", 'zachariah', "zachariah's", 'zachary', "zachary's", 'zachery', "zachery's", 'zagreb', "zagreb's", 'zaire', "zaire's", 'zairian', 'zambezi', "zambezi's", 'zambia', "zambia's", 'zambian', "zambian's", 'zambians', 'zamboni', 'zamenhof', "zamenhof's", 'zamora', 'zane', "zane's", 'zanuck', "zanuck's", 'zanzibar', "zanzibar's", 'zapata', 'zaporozhye', 'zapotec', 'zappa', "zappa's", 'zara', "zara's", 'zebedee', 'zechariah', 'zedekiah', "zedekiah's", 'zedong', "zedong's", 'zeffirelli', "zeffirelli's", 'zeke', "zeke's", 'zelig', 'zelma', "zelma's", 'zen', "zen's", 'zenger', "zenger's", 'zeno', "zeno's", 'zens', 'zephaniah', 'zephyrus', 'zeppelin', 'zest', "zest's", 'zeus', "zeus's", 'zhengzhou', 'zhivago', "zhivago's", 'zhukov', 'zibo', "zibo's", 'ziegfeld', 'ziegler', "ziegler's", 'ziggy', "ziggy's", 'zimbabwe', "zimbabwe's", 'zimbabwean', "zimbabwean's", 'zimbabweans', 'zimmerman', "zimmerman's", 'zinfandel', "zinfandel's", 'zion', "zion's", 'zionism', "zionism's", 'zionisms', 'zionist', "zionist's", 'zionists', 'zions', 'ziploc', 'zn', "zn's", 'zoe', "zoe's", 'zola', "zola's", 'zollverein', 'zoloft', 'zomba', "zomba's", 'zorn', 'zoroaster', "zoroaster's", 'zoroastrian', "zoroastrian's", 'zoroastrianism', "zoroastrianism's", 'zoroastrianisms', 'zorro', "zorro's", 'zosma', "zosma's", 'zr', "zr's", 'zsigmondy', 'zubenelgenubi', "zubenelgenubi's", 'zubeneschamali', "zubeneschamali's", 'zukor', "zukor's", 'zulu', "zulu's", 'zulus', 'zuni', 'zwingli', "zwingli's", 'zworykin', 'zyrtec', "zyrtec's", 'zyuganov', "zyuganov's", 'zürich', "zürich's", 'z', 'zanier', 'zanies', 'zaniest', 'zaniness', "zaniness's", 'zany', "zany's", 'zap', "zap's", 'zapped', 'zapping', 'zaps', 'zeal', "zeal's", 'zealot', "zealot's", 'zealots', 'zealous', 'zealously', 'zealousness', "zealousness's", 'zebra', "zebra's", 'zebras', 'zebu', "zebu's", 'zebus', 'zed', "zed's", 'zeds', 'zenith', "zenith's", 'zeniths', 'zephyr', "zephyr's", 'zephyrs', 'zeppelin', "zeppelin's", 'zeppelins', 'zero', "zero's", 'zeroed', 'zeroes', 'zeroing', 'zeros', 'zest', "zest's", 'zestful', 'zestfully', 'zests', 'zeta', 'zigzag', "zigzag's", 'zigzagged', 'zigzagging', 'zigzags', 'zilch', "zilch's", 'zillion', "zillion's", 'zillions', 'zinc', "zinc's", 'zinced', 'zincing', 'zincked', 'zincking', 'zincs', 'zing', "zing's", 'zinged', 'zinger', "zinger's", 'zingers', 'zinging', 'zings', 'zinnia', "zinnia's", 'zinnias', 'zip', "zip's", 'zipped', 'zipper', "zipper's", 'zippered', 'zippering', 'zippers', 'zippier', 'zippiest', 'zipping', 'zippy', 'zips', 'zircon', "zircon's", 'zirconium', "zirconium's", 'zircons', 'zit', "zit's", 'zither', "zither's", 'zithers', 'zits', 'zodiac', "zodiac's", 'zodiacal', 'zodiacs', 'zombi', "zombi's", 'zombie', "zombie's", 'zombies', 'zombis', 'zonal', 'zone', "zone's", 'zoned', 'zones', 'zoning', 'zonked', 'zoo', "zoo's", 'zoological', 'zoologist', "zoologist's", 'zoologists', 'zoology', "zoology's", 'zoom', "zoom's", 'zoomed', 'zooming', 'zooms', 'zoos', 'zucchini', "zucchini's", 'zucchinis', 'zwieback', "zwieback's", 'zygote', "zygote's", 'zygotes']

So fetch the vals use word_dict[key] to get the appropriate values:

def search_dictionary(key):
    check = str.startswith
    vals = word_dict[key[0]]
    if any(check(x, key) for x in vals):

Not sure if you have considered case or not so you may want to remove the lower call depending on what you want.

Upvotes: 2

Optimized regular expression search in a text corpus in Python

Answers (2)

Related Questions