edoedoedo
edoedoedo

Reputation: 1641

Fastest way to check whether a string is a substring in a list of strings

I have a static list of 4000 distinct first names: so the length of the list is big (4000), but each string has about from 4 to 12 characters (they are names).

Then, I have a dynamic list of 10000 strings retrieved from a database: these strings may have an arbitrary length.

I need to output, for each one of the 10000 strings, whether that string contains one of the 4000 names, and if so, which one. If it contains more than one name, I need just one of them (i.e. the first). Moreover, it is unlikely to find such names, so maybe just 10 of the 10000 will contain a name.

My code so far:

names # list of 4000 short static names
fields # list of 10000 retrieved strings

def findit(element):
    for name in names:
        if name in element:
            return name
    return None

output = [findit(element) for element in fields]

This works of course. However, it is totally slow, since it is unlikely to find a name, and since I'm testing for being a substring and not for equality (i.e. I cannot use bisect or other sorted based index techniques). It scans all the names list completely, almost all the time. So basically, it executes about 10000 x 4000 = 40 million "in" comparisons.

Is there an algorithm to optimize this kind of search?

Upvotes: 6

Views: 6689

Answers (3)

trincot
trincot

Reputation: 350272

You could look into turning your list of names into one regular expression. Take for example this tiny list of names:

names = ['AARON',
    'ABDUL',
    'ABE',
    'ABEL',
    'ABRAHAM',
    'ABRAM',
    'ADALBERTO',
    'ADAM',
    'ADAN',
    'ADOLFO',
    'ADOLPH',
    'ADRIAN',
]

That could be represented with the following regular expression:

\b(?:AARON|ABDUL|ABE|ABEL|ABRAHAM|ABRAM|ADALBERTO|ADAM|ADAN|ADOLFO|ADOLPH|ADRIAN)\b

But that would not be very efficient. A regular expression that is built like a tree will work better:

\b(?:A(?:B(?:E(?:|L)|RA(?:M|HAM)|DUL)|D(?:A(?:M|N|LBERTO)|OL(?:FO|PH)|RIAN)|ARON))\b

You could then automate the production of this regular expression -- possibly by first creating a dict-tree structure from the list of names, and then translating that tree into a regular expression. For the above example, that intermediate tree would look like this:

{
    'A': {
        'A': {
            'R': {
                'O': {
                    'N': {
                        '': {}
                    }
                }
            }
        }, 
        'B': {
            'D': {
                'U': {
                    'L': {
                        '': {}
                    }
                }
            }, 
            'E': {
                '': {}, 
                'L': {
                    '': {}
                }
            }, 
... etc

... which could optionally be simplified to this:

{
    'A': {
        'ARON': {
            '': {}
        }
        'B': {
            'DUL': {
                '': {}
            },
            'E': {
                '': {}, 
                'L': {
                    '': {}
                }
            },
            'RA': {
                'HAM': {
                    '': {}
                },
                'M': {
                    '': {}
                } 
            } 
        }, 

... etc

Here is the suggested code to do this:

import re 

def addToTree(tree, name):
    if len(name) == 0:
        return
    if name[0] in tree.keys():
        addToTree(tree[name[0]], name[1:])
    else:
        for letter in name:
            tree[letter] = {}
            tree = tree[letter]
        tree[''] = {}

# Optional improvement of the tree: it combines several consecutive letters into 
# one key if there are no alternatives possible
def simplifyTree(tree):
    repeat = True
    while repeat:
        repeat = False
        for key, subtree in list(tree.items()):
            if key != '' and len(subtree) == 1 and '' not in subtree.keys():
                for letter, subsubtree in subtree.items():
                    tree[key + letter] = subsubtree
                del tree[key]
                repeat = True
    for key, subtree in tree.items():
        if key != '': 
            simplifyTree(subtree)

def treeToRegExp(tree):
    regexp = [re.escape(key) + treeToRegExp(subtree) for key, subtree in tree.items()]
    regexp = '|'.join(regexp)
    return '' if regexp == '' else '(?:' + regexp + ')'

def listToRegExp(names):
    tree = {}
    for name in names:
        addToTree(tree, name[:])
    simplifyTree(tree)
    return re.compile(r'\b' + treeToRegExp(tree) + r'\b', re.I)
    
# Demo
names = ['AARON',
'ABDUL',
'ABE',
'ABEL',
'ABRAHAM',
'ABRAM',
'ADALBERTO',
'ADAM',
'ADAN',
'ADOLFO',
'ADOLPH',
'ADRIAN',
]

fields = [
    'This is Aaron speaking',
    'Is Abex a name?',
    'Where did Abraham get the mustard from?'
]

regexp = listToRegExp(names)
# get the search result for each field, and link it with the index of the field
results = [[i, regexp.search(field)] for i, field in enumerate(fields)]
# remove non-matches from the results
results = [[i, match.group(0)] for [i, match] in results if match]
# print results
print(results)

Upvotes: 4

edoedoedo
edoedoedo

Reputation: 1641

I have found out about the Aho-Corasick algorithm for Multiple string searching (see https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm), and its python implementation pyahocorasick (see http://pyahocorasick.readthedocs.io/en/latest/).

I rewrote my code using this library:

import ahocorasick

names # list of 4000 short static names
fields # list of 10000 retrieved strings

automaton = ahocorasick.Automaton()

for name in names:
    automaton.add_word(name, name)

automaton.make_automaton()

def findit_with_ahocorasick(element):
    try:
        return next(A.iter(element))[1]
    except StopIteration:
        return None


output = [findit_with_ahocorasick(element) for element in fields]

This performs so much faster compared to what I did before (i.e. I estimated a raw statistic with my data, it's about 12 seconds vs 0.8 seconds for the entire 10000 batch).

Moreover, as the docs state, the initial creation of the Automaton object, which needs to be fed with the list of names in order to create the words tree, can be pickled if the words are static as they are in my case.

Upvotes: 3

zipa
zipa

Reputation: 27869

I think this could be faster, if you are to use set() of input and just check for intersection between sets:

names = ['AARON',
    'ABDUL',
    'ABE',
    'ABEL',
    'ABRAHAM',
    'ABRAM',
    'ADALBERTO',
    'ADAM',
    'ADAN',
    'ADOLFO',
    'ADOLPH',
    'ADRIAN',
]

search = {'BE', 'LFO', 'AB'}

def get_all_substrings(input_string):
  length = len(input_string)
  return {input_string[i:j+1] for i in range(length) for j in xrange(i,length)}

names_subs = {name: get_all_substrings(name)  for name in names}
result = [name for name, sub in names_subs.items() if bool(search.intersection(sub))]

Upvotes: 1

Related Questions