compie
compie

Reputation: 10536

Search for multiple words by prefix (trie data structure)

How can I use a trie (or another data structure or algorithm) to efficiently search for multiple words by prefix?

For example: suppose this is my data set:

A trie data structure allows me to efficiently retrieve all names starting with "Bo" (thus without iterating over all the names). But I also want to search on the last name by prefix, thus searching for "Wa" should find "Bobby Walker". And to complicate things: when the user searches for "Bo Wa" this should also find the same name. How can I implement this? Should I use a separate trie structure for each part of the name? (And how to combine the results)?

Background: I'm writing the search functionality for a big address book (10000+ names). I want to have a really fast autocomplete function that shows results while people are typing the first few letters of the first & last name. I already have a solution that uses a regex, but it needs to iterate over all names which is to slow.

Upvotes: 5

Views: 2772

Answers (3)

Oren Yosifon
Oren Yosifon

Reputation: 927

A very good data structure would be a Burst Trie

There's a Scala implementation.

Upvotes: 3

Cybercartel
Cybercartel

Reputation: 12592

You could try a second trie with the reversed string and a wildcard search:http://phpir.com/tries-and-wildcards/

Upvotes: 2

Iulian Popescu
Iulian Popescu

Reputation: 2643

I think that a sorted array will also fit for your requirements, an array containing Person objects (they have a firstName and a lastName field). Let's say that you have a prefix and want to find all the values that fit your prefix. Simply run a binary search to find the first position (let's say is firstIndex) where your prefix appears on firstName and one more to find the last position (lastIndex). Now you can retrieve your values in O(lastIndex - firstIndex). The same goes when you want to find them by lastName. When you have a prefixFirstName and a prefixLastName you can search for the interval where values match for prefixFirstName and then, on this interval, you can check for the values matching the prefixLastName. For a conclusion, when you have one or two prefixes you run 4 binary searches (around 17 iterations per search for 100k names) which is fast enough and you can retrieve them in linear time. Even if it isn't the fastest solution, I suggested it since it's easy to understand and easy to code.

Upvotes: 1

Related Questions