Fastest way to lookup with pattern

Question

Imagine I have a list of several-hundred unique names, e.g.

["john", "maria", "joseph", "richard", "samantha", "isaac", ...]

What's the best way I can store these to provide a fast lookup-time by matching against a pattern?

I only need to match "masks", can't think of a better word for it.

Basically, I get in letters and their positions, ____a__ (where _ represents an unknown letter.) Then I need to find all values in the data structure that match that mask, e.g. in this case it would return "richard", but it should also be possible to get multiple "returned" values.

Jim Mischel · Accepted Answer

Seems like a lot of work for "hundreds" of names. Doing a linear search on a list of hundreds of names will be very fast. Now, if you're talking hundreds of thousands or millions ...

In any case, you can speed this up using a dictionary. You can pre-process the data into a dictionary whose keys are a combination of character and position, and values are the words that contain that character at that position. For example, if you were to index "john" and "joseph", you would have:

{'j',0},{"john","jospeh"}
{'o',1},{"john","joseph"}
{'h',2},{"john"}
{'n',3},{"john}
{'s',2},{"joseph"}
{'e',3},{"joseph"}
{'p',4},{"joseph"}
{'h',5},{"joseph"}

Now let's say you're given the mask "jo...." (the dots are "don't care"). You'd turn that into two keys:

{'j',0}
{'o',1}

You query the dictionary for the list of words that has 'j' at index 0. Then you query the dictionary for the list of words that has 'o' at index 1. Then you intersect the lists to get your result.

It's a simple inverted index, but on character rather than on word.

The lists themselves will cost you a total of O(m * n) space, where m is the total number of words and n is the average word length in characters. At maximum, the number of dictionary entries will be 26*max_word_length. In practice, it will probably be much less.

If you make the values a HashSet rather than List, intersection will go much faster. It'll increase your memory footprint, though.

That should be faster than linear search if your masks contain only a few characters. The longer the mask, the more lists you'll have to intersect.

For the dictionary key, I'd recommend:

public struct Key
{
    public char KeyChar;
    public int Pos;
    public override int GetHashCode()
    {
        return (int)KeyChar + Pos << 16;
    }
    public override bool Equals(object obj)
    {
        if (!obj is Key) return false;
        var other = (Key)obj;
        return KeyChar == other.KeyChar && Pos == other.Pos;
    }
}

So your dictionary would be Dictionary>.

Fastest way to lookup with pattern

Answers (2)

Related Questions