user3310334
user3310334

Reputation:

Fastest way to lookup with pattern

Imagine I have a list of several-hundred unique names, e.g.

["john", "maria", "joseph", "richard", "samantha", "isaac", ...]

What's the best way I can store these to provide a fast lookup-time by matching against a pattern?

I only need to match "masks", can't think of a better word for it.

Basically, I get in letters and their positions, ____a__ (where _ represents an unknown letter.) Then I need to find all values in the data structure that match that mask, e.g. in this case it would return "richard", but it should also be possible to get multiple "returned" values.

Upvotes: 0

Views: 266

Answers (2)

BlackBear
BlackBear

Reputation: 22979

If the longest word has m letters, then you can keep m lists l[1], ..., l[m] such that the words in each list l[i] are sorted lexicographically starting from the i-th letter in every word (shorter words will not appear in that list). Then, if your query is ...ac., just perform a binary search in list l[4].

This will cost you O(mn) in memory and takes O(m n log n) time to build, but will give you O(log n) query time, which is the fastest you can get.

EDIT
Good news, I have recently stumbled upon range trees, that would allow you to perform this kind of queries somewhat efficiently, namely in O(log^m(n)+k) time, and requiring O(n log^(d-1)(n)) space.

They are not straightforward to implement, in the sense that you need to build a binary search tree sorting the words by the first letter, then build a binary search tree for every internal node which stores the words in the subtree of that node sorted by the second letter, and so on.

On the other hand, this would allow you to perform a wider range of queries, namely you can look for contiguous intervals of letters, i.e. a pattern like ..[a-c].[b-f].

Upvotes: 1

Jim Mischel
Jim Mischel

Reputation: 133995

Seems like a lot of work for "hundreds" of names. Doing a linear search on a list of hundreds of names will be very fast. Now, if you're talking hundreds of thousands or millions ...

In any case, you can speed this up using a dictionary. You can pre-process the data into a dictionary whose keys are a combination of character and position, and values are the words that contain that character at that position. For example, if you were to index "john" and "joseph", you would have:

{'j',0},{"john","jospeh"}
{'o',1},{"john","joseph"}
{'h',2},{"john"}
{'n',3},{"john}
{'s',2},{"joseph"}
{'e',3},{"joseph"}
{'p',4},{"joseph"}
{'h',5},{"joseph"}

Now let's say you're given the mask "jo...." (the dots are "don't care"). You'd turn that into two keys:

{'j',0}
{'o',1}

You query the dictionary for the list of words that has 'j' at index 0. Then you query the dictionary for the list of words that has 'o' at index 1. Then you intersect the lists to get your result.

It's a simple inverted index, but on character rather than on word.

The lists themselves will cost you a total of O(m * n) space, where m is the total number of words and n is the average word length in characters. At maximum, the number of dictionary entries will be 26*max_word_length. In practice, it will probably be much less.

If you make the values a HashSet<string> rather than List<string>, intersection will go much faster. It'll increase your memory footprint, though.

That should be faster than linear search if your masks contain only a few characters. The longer the mask, the more lists you'll have to intersect.

For the dictionary key, I'd recommend:

public struct Key
{
    public char KeyChar;
    public int Pos;
    public override int GetHashCode()
    {
        return (int)KeyChar + Pos << 16;
    }
    public override bool Equals(object obj)
    {
        if (!obj is Key) return false;
        var other = (Key)obj;
        return KeyChar == other.KeyChar && Pos == other.Pos;
    }
}

So your dictionary would be Dictionary<Key, HashSet<string>>.

Upvotes: 3

Related Questions