Misha Moroshko
Misha Moroshko

Reputation: 171459

What's the most appropriate data structure for autosuggest?

I'd like to implement an autosuggest component. For every user input, the component should provide zero or more suggestions.

For example, if user types 'park', suggestions could be: ['Parkville', 'Parkwood', 'Bell Park'].

The requirements are simple:

What data structure would you choose to implement this in Javascript?

Are there any Javascript/Node.js libraries that can help?

Upvotes: 3

Views: 439

Answers (4)

user2104560
user2104560

Reputation:

I think the best data structure for such task is a trie. About case insetivity - just lowercase each word before adding to the trie and perform searching on lower cased words.

When you reach some point of the trie, there are number of children nodes which are satisfying strings - string with have the prefix from root to current point.

Output suggestions - recursively walk from current point(which is reached from root to user typed prefix) and print suggestion on nodes which marked as leafs. Stop printing after ~10 outputs because trie may have many satisfying words.

Here is js implementations: trie-js, trie and many others. Just search js+trie. May be trie+autosuggest+js will also work)

UPDATE 1

If you want to output all variants in O(1) (O(1) for each suggestion, of course), without recursive walk, you may store arraylist of references in each node. Arraylist stores indices of all words which belongs to node and each value is an index in other dictionary araylist.

Something like that:

Add word to dict:

  1. Check in O(word_len) is it in a trie (already added). If not, add to dictionary and remember index in "storage"

     if(!inTrie(word)){
        dict.push(word);
        index = dict.length-1; //zero based indices
        // now adding to trie
        for each node visited while addition to trie : node.references.push(index)
     }
    
  2. Search:

    Go to node with prefix==typed word;
    for(int i=0;i<min(curNode.references.length(),10);i++)
    print(dict[curNode.references[i]];
    

UPDATE 2

About 'pa' --> 'Very cool park'

You definetily should split phrase to separate words in order each word be "searchable" in a trie. BUT! As you treat phrase as one word, you should store it one cell.

I mean:

String phrase = "Very cool parl";
dict.push(phrase);
index = dict.length-1;

parts[] = split(phrase);
for part in parts{
 add part - for each node visited while addition of part perform node.references.push(index);
}

In other words, code of phrase the same with code of single word. And reference is the same, because we store all part together in one cell as a phrase. Difference is in splitting and adding pharse by parts. Simple, you see.

UPDATE 3

BTW, references storing is not so "expensive" in memory consumption - each letter in word is some node in a trie which means 1 entry in some arraylist(one integer index of this word in a global storage array). So, you have to only extra O(dictionary_length) memory, which is ~ 50000*20 = 1 000 000 integers ~ 4 MB, assuming that each word has at most 20 letters. So, upper bound of needed memory is 4 MB.

UPDATE 4

About 'e e' --> East Eagle.

OK, before posting any idea I would like to warn that this is very strange autosuggest behaviour and usually autosuggest matches one prefix but not all prefixes.

There is very simple idea which will increase searhcing complexity for such several prefixes parallel search for some delta where delta complexity = complexity of finding sets intersection.

  1. Now global storage contains not just indices but pairs <a,b> where a = index in storage, b = index in pharse. For simple word b=0 or -1 or any special value.
  2. Now each trie node reference arraylist contains pairs. When user types prefixes phrase, for example "ea ri", you shoud find 'ea' node as usual, iterate over refernces but take to account only those entries, where a=any, b=1, because ea index in typed phrase = 1. Put all such a indices where b=1 to some set. Find ri node as usual, iterate over references and put those a indices to other set where b=2 and so on. Find intersection of indices sets. Ouput storage words by index, where index belongs to intersection of sets.

When you searching not phrase but simple word you iterate over all reference items.

Upvotes: 4

ACE Arthur
ACE Arthur

Reputation: 211

In fact, trie is proper data structure for achieve your goal. The implementation is short and easy. My solution are list below. Usages are appended after Trie implementation.

function TrieNode(ch) {
    this.key = ch;
    this.isTail = false;
    this.children = [];
}

TrieNode.prototype = {
    constructor : TrieNode,

    /**
     * insert keyword into trie
     */
    insert : function(word) {
        if (word && word.length == 0)
            return;
        var key = word[0];
        if (this.children[key] == null) {
            this.children[key] = new TrieNode(key);
        }
        if (word.length == 1) {
            this.children[key].isTail = true;
        } else if (word.length > 1) {
            this.children[key].insert(word.slice(1));
        }
    },

    /**
    * return whether a word are stored in trie
    */
    search : function(word) {
        if (word && word.length == 0 || this.children[word[0]] == null)
            return false;
        if (word.length == 1) {
            return this.children[word[0]].isTail;
        } else {
            return this.children[word[0]].search(word.slice(1));
        }
    },


    /**
     * NOTICE: this function works only if length of prefix longer than minimum trigger length
     * 
     * @param prefix
     *            keywords prefix
     * @returns {Array} all keywords start with prefix
     */
    retrieve : function(prefix) {
        var MINIMUM_TRIGGER_LENGTH = 1;
        if (prefix == null || prefix.length < MINIMUM_TRIGGER_LENGTH)
            return [];
        var curNode = this.walk(prefix);
        var collection = [];
        curNode && curNode.freetrieve(prefix, collection);
        return collection;
    },

    walk : function(prefix) {
        if (prefix.length == 1) {
            return this.children[prefix];
        }
        if (this.children[prefix[0]] == null) {
            return null;
        }
        return this.children[prefix[0]].walk(prefix.slice(1));
    },

    freetrieve : function(got, collection) {
        for ( var k in this.children) {
            var child = this.children[k];
            if (child.isTail) {
                collection.push(got + child.key);
            }
            child.freetrieve(got + child.key, collection);
        }
    }
}

// USAGE lists below
function initTrieEasily(keywords){
    let trie= new TrieNode();
    keywords.forEach(word => trie.insert(word));
    return trie;
}

var words=['mavic','matrix','matrice','mavis','hello world'];

var trie=initTrieEasily(words);
trie.retrieve('ma');  // ["mavic", "mavis", "matrix", "matrice"]
trie.retrieve("mat")  // ["matrix", "matrice"]
trie.search("hello"); // "false"
trie.search("hello world");  //"true"
trie.insert("hello");
trie.search("hello"); // "true"
trie.retrieve("hel"); // ["hello", "hello world"]

Upvotes: 0

Jim Mischel
Jim Mischel

Reputation: 134125

Sometimes the simple way is the best. You said that you have ~50,000 entries in your dictionary. You don't say how many of them have multiple words (i.e. "Bell Park" or "Martin Luther King Drive", etc.). Just for argument, let's assume that the average number of words per dictionary entry is 2.

I'm not great with Javascript, so I'm going to describe this in general terms, but you should be able to implement it fairly easily.

In your preprocessing step, create an array of all the items in your database (i.e. all 50,000). So, for example:

Carpark
Cool Park
Park
Pike's Peak
...

Then, create a map that contains an entry for each word, and a list of indexes to the item in the first array that contains it. So your second array would contain:

Carpark, {0}
Cool, {1}
Park, {1,2}
Pike's, {3}
Peak, {3}

Sort that second array by the word. So the order would be {Carpark,Cool,Park,Peak,Pike's}.

When the user types "P", do a binary search on the array of words to find the first word that starts with "P". You can do a sequential scan of the data from that point until you get to a word that doesn't start with P. As you visit each word, add the words referenced in the list of indexes to your output. (You'll have to do some de-duplication here, but that's easy enough.)

Binary search is O(log n), so finding the first word will be pretty fast. Especially with such a small amount of data. If you're doing an HTTP request for each letter typed, the communication time will dwarf the processing time. There's not a particularly good reason to try to speed this up on the server side.

You can however, reduce the load on the server. Realize that when the user types "P" and the client gets data back from the server, the client now has all of the possible strings that could possibly be returned if the user were to type "PA". That is, the results from "PA" are a subset of the results from "P".

So if you modified your code to make the client make a request to the server only on the first character typed, you could do subsequent searches on the client. All you'd have to do is make the server return the list of matched words (from the second data structure) as well as the matched phrases indexed by those words. So when the user types the second, third, forth, etc. characters, the client goes through the process of filtering. The server doesn't have to get involved.

The benefits of that, of course, are faster response and less load on the server. The cost is a small amount of added complexity on the client, and a small amount of additional data returned on the first request. But that extra data returned on the first request will likely be less than what would have been returned by the server on the second request.

Upvotes: 1

jorgenbs
jorgenbs

Reputation: 557

Give http://lunrjs.com a try. It lets you set boost certain properties if you like. Simple and small.

If you need something even simpler you could see if there are any implementations of the Levenshtein Distance. A cooler algorithm is Soundex, which rates based on the phonetic properties of words.

Upvotes: 1

Related Questions