Approaches for finding matches in a large dataset

Question

I have a project where, given a list of ~10,000 unique strings, I want to find where those strings occur in a file with 10,000,000+ string entries. I also want to include partial matches if possible. My list of ~10,000 strings is dynamic data and updates every 30 minutes, and currently I'm not able to process all of the searching to keep up with the updated data. My searches take about 3 hours now (compared to the 30 minutes I have to do the search within), so I feel my approach to this problem isn't quite right.

My current approach is to first create a list from the 10,000,000+ string entries. Then each item from the dynamic list is searched for in the larger list using an in-search.

results_boolean = [keyword in n for n in string_data]

Is there a way I can greatly speed this up with a more appropriate approach?

Tim Peters · Accepted Answer

In general, you would want to preprocess the large, unchanging data is some way to speed repeated searches. But you said too little to suggest something clearly practical. Like: how long are these strings? What's the alphabet (e.g., 7-bit ASCII or full-blown Unicode?)? How many characters total are there? Are characters in the alphabet equally likely to appear in each string position, or is the distribution highly skewed? If so, how? And so on.

Here's about the simplest kind of indexing, buiding a dict with a number of entries equal to the number of unique characters across all of string_data. It maps each character to the set of string_data indices of strings containing that character. Then a search for a keyword can be restricted to the only string_data entries now known in advance to contain the keyword's first character.

Now, depending on details that can't be guessed from what you said, it's possible even this modest indexing will consume more RAM than you have - or it's possible that it's already more than good enough to get you the 6x speedup you seem to need:

# Preprocessing - do this just once, when string_data changes.
def build_map(string_data):
    from collections import defaultdict
    ch2ixs = defaultdict(set)
    for i, s in enumerate(string_data):
        for ch in s:
            ch2ixs[ch].add(i)
    return ch2ixs

def find_partial_matches(keywords, string_data, ch2ixs):
    for keyword in keywords:
        ch = keyword[0]
        if ch in ch2ixs:
            result = []
            for i in ch2ixs[ch]:
                if keyword in string_data[i]:
                    result.append(i)
            if result:
                print(repr(keyword), "found in strings", result)

Then, e.g.,

string_data = ['banana', 'bandana', 'bandito']
ch2ixs = build_map(string_data)

find_partial_matches(['ban', 'i', 'dana', 'xyz', 'na'],
                     string_data,
                     ch2ixs)

displays:

'ban' found in strings [0, 1, 2]
'i' found in strings [2]
'dana' found in strings [1]
'na' found in strings [0, 1]

If, e.g., you still have plenty of RAM, but need more speed, and are willing to give up on (probably silly - but can't guess from here) 1-character matches, you could index bigrams (adjacent letter pairs) instead.

In the limit, you could build a trie out of string_data, which would require lots of RAM, but could reduce the time to search for an embedded keyword to a number of operations proportional to the number of characters in the keyword, independent of how many strings are in string_data.

Note that you should really find a way to get rid of this:

results_boolean = [keyword in n for n in string_data]

Building a list with over 10 million entries for every keyword search makes every search expensive, no matter how cleverly you index the data.

Note: a probably practical refinement of the above is to restrict the search to strings that contain all of the keyword's characters:

def find_partial_matches(keywords, string_data, ch2ixs):
    for keyword in keywords:
        keyset = set(keyword)
        if all(ch in ch2ixs for ch in keyset):
            ixs = set.intersection(*(ch2ixs[ch] for ch in keyset))
            result = []
            for i in ixs:
                if keyword in string_data[i]:
                    result.append(i)
            if result:
                print(repr(keyword), "found in strings", result)

Approaches for finding matches in a large dataset

Answers (2)

Related Questions