hulk_baba
hulk_baba

Reputation: 87

Implementing a document search engine

Background to the Problem


Hello all, I was working on a project of searching relevant documents in a pile of documents based on the query provided. As this is a mini project and I have a typical memory architecture, I am assuming I have not more 100 documents and each document contains not more than 1000 words(a word does not have more than 10 characters). I get many queries and I have to process queries as fast as I can (definitely not more than one second).

My First Approach (Naive and Non-scalable):


As users are allowed to upload documents, whenever I receive a document, I look for “potential” keywords and store keywords as key and document as value pair or in an MYSQL table. Clearly, this has to be done manually and not seems much like what programmers would do.

My Second Approach (Somewhat better):


I take each document, scan every word of it and add this word to a Trie Data structure, so for 100 documents I have to search 100 Tries and if the query has length of l, this approach will take in worst O(Number of words across all documents* length of largest word) to build the trie and to query O(length of query). This is pretty reasonable. To implement this I would keep a vector of Trie root nodes to every document and iterate over every trie node and search within each trie. If I get at least half of the words of query matched, I store that document as a potential result. I will not give more than some cut-off number of documents as result.

My Question to Community:


I would ask what do you think of my approaches? How can I optimise them, what other improvements can I do in the existing approaches? Can this be done more efficiently by using other algorithms or data structure? Surfing the web I have come across algorithms like Boyer-Moore and Aho-Corasick and some suggestion to tweak into what algorithms Lucene Apache implements etc. What do you suggest here?

Upvotes: 4

Views: 3993

Answers (1)

Chen Pang
Chen Pang

Reputation: 1388

The most basic way of implementing full text search is building an inverted index and rank matching documents with metrics like TF-IDF

As new documents come in, you extract the words in the document and add the document to your inverted index.

When a query comes in, you find matching document from the index, and perform some sorting based on TF-IDF (or other metrics you care about). You then return k top ranked documents as result of the query.

Beyond that, there are tons of research in the Information Retrieval field that's making the operation more efficient, as well as making the results (top-k document) better.

Upvotes: 5

Related Questions