Fastest way to extract matching strings

Question

I want to search for words that match a given word in a list (example below). However, say there is a list that contain millions of words. What is the most efficient way to perform this search?. I was thinking of tokenizing each list and putting the words in hashtable. Then perform the word search / match and retrieve the list of words that contain this word. From what I can see is this operation will take O(n) operations. Is there any other way? may be without using hash-tables?.

words_list = ['yek', 'lion', 'opt'];
# e.g. if we were to search or match the word "key" with the words in the list we should get the word "yek" or a list of words if there many that match

Also, is there a python library or third party package that can perform efficient searches?

abarnert · Accepted Answer

It's not entirely clear when you mean by "match" here, but if you can reduce that to an identity comparison, the problem reduces to a set lookup, which is O(1) time.

For example, if "match" means "has exactly the same set of characters":

words_set = {frozenset(word) for word in words_list}

Then, to look up a word:

frozenset(word) in words_set

Or, if it means "has exactly the same multiset of characters" (i.e., counting duplicates but ignoring order):

words_set = {sorted(word) for word in words_list}

sorted(word) in words_set

… or, if you prefer:

words_set = {collections.Counter(word) for word in words_list}

collections.Counter(word) in words_set

Either way, the key (no pun intended… but maybe it should have been) idea here is to come up with a transformation that turns your values (strings) into values that are identical iff they match (a set of characters, a multiset of characters, an ordered list of sorted characters, etc.). Then, the whole point of a set is that it can look for a value that's equal to your value in constant time.

Of course transforming the list takes O(N) time (unless you just build the transformed set in the first place, instead of building the list and then converting it), but you can use it over and over, and it takes O(1) time each time instead of O(N), which is what it sounds like you care about.

If you need to get back the matching word rather than just know that there is one, you can still do this with a set, but it's easier (if you can afford to waste a bit of space) with a dict:

words_dict = {frozenset(word): word for word in words_list}

words_dict[frozenset(word)] # KeyError if no match

If there could be multiple matches, just change the dict to a multidict:

words_dict = collections.defaultdict(set)
for word in words_list:
    words_dict[frozenset(word)].add(word)

words_dict[frozenset(word)] # empty set if no match

Or, if you explicitly want it to be a list rather than a set:

words_dict = collections.defaultdict(list)
for word in words_list:
    words_dict[frozenset(word)].append(word)

words_dict[frozenset(word)] # empty list if no match

If you want to do it without using hash tables (why?), you can use a search tree or other logarithmic data structure:

import blist # pip install blist to get it

words_dict = blist.sorteddict()
for word in words_list:
    words_dict.setdefault(word, set()).add(word)

words_dict[frozenset(word)] # KeyError if no match

This looks almost identical, except for the fact that it's not quite trivial to wrap defaultdict around a blist.sorteddict—but that just takes a few lines of code. (And maybe you actually want a KeyError rather than an empty set, so I figured it was worth showing both defaultdict and normal dict with setdefault somewhere, so you can choose.)

But under the covers, it's using a hybrid B-tree variant instead of a hash table. Although this is O(log N) time instead of O(1), in some cases it's actually faster than a dict.

Fastest way to extract matching strings

Answers (1)

Related Questions