David542
David542

Reputation: 110452

Algorithm for a like '%term%' search

Suppose I have a db column filepath and I want to do a contains search, for example:

matches = []
for filepath in filepaths:
    if "mutan" in filepath.lower():
        matches.append(filepath)

Is there anyway at all to optimize this algorithm? I am open to storing ancillary structs or other methods, but what might be some practical ways to do this?

The only thing I can think of is to tokenize the filepath so that I have something like:

/my/new/File.jpg ==>
# 1 char
['e', 'g', 'f', 'i', 'j', 'm', 'l', '/', 'n', 'p', 'w', 'y', '.']
# 2 char
['/n', '/m', 'le', 'y/', 'w/', '/f', 'jp', 'ne', 'e.', 'il', 'fi', 'ew', 'my', '.j', 'pg']
# etc...

And then do a lookup with the term I have to see if it exists there, but to tokenize a word by every letter seems like it would take forever to do and take up tons of space as well. Is there another way to do this?

Upvotes: 0

Views: 74

Answers (2)

Dan D.
Dan D.

Reputation: 74675

Rather than "mutan" in filepath.lower(), you might consider:

import re
pat = re.compile("mutan", re.I)

if pat.search(filepath) is not None:

This combines the substring search with the case-ignoring. I have a program that uses this with a more complex pattern and it can filter the names of a directory tree as fast as they can be read.

Upvotes: 0

rlb
rlb

Reputation: 1714

A couple of common algorithms are the https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm and https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm there are others too.

Depending on your search needs you might also like to implement a small bloom filter, we use a 32 bit mask for each string we might want to search, setting bit 1 for 'a' present, bit 2 for 'b' etc. This allows us to quickly eliminate complete database rows and not search all.

To get more specific, we would need to know what you want to optimise for, memory, cpu etc. scanning even a few Gb is pretty quick on high end CPUs these days.

Upvotes: 1

Related Questions