OscarRyz
OscarRyz

Reputation: 199215

autocomplete algorithms, papers, strategies, etc

I'm wondering if anyone has good resources to read or code to experiment for "autcomplete"

I would like to know what's the theory behind autocompletion, where to start what are the commonn mistakes etc.

I found fascinating the way products like Enso, Launchy, Google chrome and even tcsh perform their auto complete, I started my self just for curiosity some sample code and I got to the conclusion this must be a field widely explored before.

I would appreciate if someone shares any good technical resource on how to implement this.

Thanks in advance.

Upvotes: 15

Views: 7957

Answers (4)

Ario
Ario

Reputation: 659

It is an open problem and there are dozen of strategies based on situations. based on my knowledge, I list a short highlights of some of well known Auto Completion strategies and their corresponding data structures. also I tried to summarizing their major pros and cons related to Auto Completion problem.

Brute Force :

  • pros: can achieved by checking all universe(input) as the next step
  • pros: super easy
  • pros: it's good for tiny dataset with limited states
  • cons: connections doesn't stored so each time you have to perform a search
  • cons: has the worst time complexity.

Prefixed Tree( Trie ) :

Trie from Wikipedia

  • pros: it is the simplest data structure designed for these kind of problems.
  • pros: list of all available next states are stored in each state.
  • cons: data size should be small( at most should be a low fractional of RAM size).

Directed Acyclic Graph (DAG) :

Trie has space problem so other data structures main goal is reducing the space complexity. Directed Acyclic Graph (DAG) is one of the options. by using DAG you can merge all similar subpath to only one. as a result much of space will be reserved.

Directed Acyclic Graph

fast-autocomplete repository is in this area which uses Directed Word Graph (DWG) and Levenshtein Edit Distance.


Some other Tree options :

On each state(or Node) there is a search problem. linear search is the worst case option so most strategies improved their search time by using either sorting( O(nlog(n) ) and then using Binary search( O(log(n)) ) or using a hashtable ( O(1), fast but has more space complexity). encountering so many trade-off dilemmas, other tree data structure variants like Radix Tree, Suffix Tree, Suggest Tree and Merkle Tree may come useful.

Prioritizig Offers : Markov Chain can be used to prioritize next states. it stores connection probabilities which leads to more accurate results.

Markov Chain

Artificial Intelligence strategies:

  • pros: good models are small and running fast.
  • pros: good models just stores discriminative features.
  • pros: huge dataset friedly
  • pros: Natural Language Processing (NLP ) frameworks has predefined algorithms (higher level API)
  • pros: long term learning power is greatly increased.
  • cons: understanding input topology is important.
  • cons: preprocessing is hard process.
  • cons: training time is long.
  • cons: convergence may needs too many retrying and retraining which is a hard process.

Long short term memory (LSTM ) :

there are so many useful Machine Learning and Deep Learning strategies. a good strategy, you can consider Auto Completion as a time series problem so you can use some models like LSTM.

LSTM


Transformers :

The last but not least, I recommend Transformer models. currently they are game changers. a great Transformer based language model is Google BERT. it is highly promising on predicting future sequences. Google BERT

I also recommend, before starting your Auto Completion project using transformers, please take a look at python_autocomplete repository which uses Transformers and LSTMs to learn Python source code.


Good luck!

Upvotes: 1

Filipe Miguel Fonseca
Filipe Miguel Fonseca

Reputation: 6436

Autocomplete is usually implemented using one of the following:

  • Trees. By indexing the searchable text in a tree structure (prefix tree, suffix tree, dawg, etc..) one can execute very fast searches at the expense of memory storage. The tree traversal can be adapted for approximate matching.
  • Pattern Partitioning. By partitioning the text into tokens (ngrams) one can execute searches for pattern occurrences using a simple hashing scheme.
  • Filtering. Find a set of potential matches and then apply a sequential algorithm to check each candidate.

A couple of papers about the subject:

  • Bořivoj Melichar. Approximate String Matching by Finite Automata;
  • Gonzalo Navarro. A Guided Tour to Approximate String Matching;
  • Leonid Boytsov. Indexing Methods for Approximate Dictionary Searching: Comparative Analysis;
  • Marios Hadjieleftheriou and Divesh Srivastava. Approximate String Processing;
  • Surajit Chaudhuri and Raghav Kaushik. Extending Autocompletion To Tolerate Errors;

Take a look at completely, a Java autocomplete library.

Upvotes: 0

Moishe Lettvin
Moishe Lettvin

Reputation: 8471

Check out this blog on implementing autocomplete using GWT:

http://jroller.com/glongman/entry/gwt_autocompleter

But I would recommend you first start with something very simple on your own to grasp how the implementation is done. I'd start with a Trie, maybe even stored completely on the client, then progress to optimizing with server queries if you think they're necessary.

Upvotes: 2

Related Questions