user2032118
user2032118

Reputation: 455

Google search suggestion implementation

In a recent amazon interview I was asked to implement Google "suggestion" feature. When a user enters "Aeniffer Aninston", Google suggests "Did you mean Jeniffer Aninston". I tried to solve it by using hashing but could not cover the corner cases. Please let me know your thought on this.

Upvotes: 2

Views: 2015

Answers (1)

user2104560
user2104560

Reputation:

There are 4 most common types of erros -

  1. Omitted letter: "stck" instead of "stack"
  2. One letter typo: "styck" instead of "stack"
  3. Extra letter: "starck" instead of "stack"
  4. Adjacent letters swapped: "satck" instead of "stack"

BTW, we can swap not adjacent letters but any letters but this is not common typo.

Initial state - typed word. Run BFS/DFS from initial vertex. Depth of search is your own choice. Remember that increasing depth of search leads to dramatically increasing number of "probable corrections". I think depth ~ 4-5 is a good start.

After generating "probable corrections" search each generated word-candidate in a dictionary - binary search in sorted dictionary or search in a trie which populated with your dictionary.

Trie is faster but binary search allows searching in Random Access File without loading dictionary to RAM. You have to load only precomputed integer array[]. Array[i] gives you number of bytes to skip for accesing i-th word. Words in Random Acces File should be written in a sorted order. If you have enough RAM to store dictionary use trie.

Before suggesting corrections check typed word - if it is in a dictionary, provide nothing.

UPDATE

Generate corrections should be done by BFS - when I tried DFS, entries like "Jeniffer" showed "edit distance = 3". DFS doesn't works, since it make a lot of changes which can be done in one step - for example, Jniffer->nJiffer->enJiffer->eJniffer->Jeniffer instead of Jniffer->Jeniffer.

Sample code for generating corrections by BFS

static class Pair
    {
        private String word;
        private byte dist; 
        // dist is byte because dist<=128.
        // Moreover, dist<=6 in real application

        public Pair(String word,byte dist)
        {
            this.word = word;
            this.dist = dist;
        }

        public String getWord()
        {
            return word;
        }

        public int getDist()
        {
            return dist;
        }
    }


    public static void main(String[] args) throws Exception
    {

        HashSet<String> usedWords;
        HashSet<String> dict;
        ArrayList<String> corrections;
        ArrayDeque<Pair> states;

        usedWords = new HashSet<String>();
        corrections = new ArrayList<String>();
        dict = new HashSet<String>();
        states = new ArrayDeque<Pair>();

        // populate dictionary. In real usage should be populated from prepared file.
        dict.add("Jeniffer");
        dict.add("Jeniffert"); //depth 2 test

        usedWords.add("Jniffer");
        states.add(new Pair("Jniffer", (byte)0));
        while(!states.isEmpty())
        {
            Pair head = states.pollFirst();
            //System.out.println(head.getWord()+" "+head.getDist());
            if(head.getDist()<=2) 
            {
                // checking reached  depth.
                //4 is the first depth where we don't generate anything  

                // swap adjacent letters
                for(int i=0;i<head.getWord().length()-1;i++)
                {
                    // swap i-th and i+1-th letters
                    String newWord = head.getWord().substring(0,i)+head.getWord().charAt(i+1)+head.getWord().charAt(i)+head.getWord().substring(i+2);
                    // even if i==curWord.length()-2 and then i+2==curWord.length 
                    //substring(i+2)  doesn't throw exception and returns empty string
                    // the same for substring(0,i) when i==0

                    if(!usedWords.contains(newWord))
                    {
                        usedWords.add(newWord);
                        if(dict.contains(newWord))
                        {
                            corrections.add(newWord);
                        }
                        states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
                    }
                }

                 // insert letters 
                for(int i=0;i<=head.getWord().length();i++)
                      for(char ch='a';ch<='z';ch++)
                      {
                            String newWord = head.getWord().substring(0,i)+ch+head.getWord().substring(i);
                            if(!usedWords.contains(newWord))
                            {
                                usedWords.add(newWord);
                                if(dict.contains(newWord))
                                {
                                    corrections.add(newWord);
                                }
                                states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
                            }
                      }
            }
        }

        for(String correction:corrections)
        {
          System.out.println("Did you mean "+correction+"?");
        }

        usedWords.clear();
        corrections.clear();
        // helper data structures must be cleared after each generateCorrections call - must be empty for the future usage. 

    }

Words in a dictionary - Jeniffer,Jeniffert. Jeniffert is just for testing)

Output:

Did you mean Jeniffer?
Did you mean Jeniffert?

Important!

I choose depth of generating = 2. In real application depth should be 4-6, but as number of combinations grows exponentially, I don't go so deep. There are some optomizations devoted to reduce number of branches in a searching tree but I don't think much about them. I wrote only main idea.

Also, I used HashSet for storing dictionary and for labeling used words. It seems HashSet's constant is too large when it containt million objects. May be you should use trie both for word in a dictionary checking and for is word labeled checking.

I didn't implement erase letters and change letters operations because I want to show only main idea.

Upvotes: 2

Related Questions