Implementing a document search engine

Question

Background to the Problem

Hello all, I was working on a project of searching relevant documents in a pile of documents based on the query provided. As this is a mini project and I have a typical memory architecture, I am assuming I have not more 100 documents and each document contains not more than 1000 words(a word does not have more than 10 characters). I get many queries and I have to process queries as fast as I can (definitely not more than one second).

My First Approach (Naive and Non-scalable):

As users are allowed to upload documents, whenever I receive a document, I look for “potential” keywords and store keywords as key and document as value pair or in an MYSQL table. Clearly, this has to be done manually and not seems much like what programmers would do.

My Second Approach (Somewhat better):

I take each document, scan every word of it and add this word to a Trie Data structure, so for 100 documents I have to search 100 Tries and if the query has length of l, this approach will take in worst O(Number of words across all documents* length of largest word) to build the trie and to query O(length of query). This is pretty reasonable. To implement this I would keep a vector of Trie root nodes to every document and iterate over every trie node and search within each trie. If I get at least half of the words of query matched, I store that document as a potential result. I will not give more than some cut-off number of documents as result.

My Question to Community:

I would ask what do you think of my approaches? How can I optimise them, what other improvements can I do in the existing approaches? Can this be done more efficiently by using other algorithms or data structure? Surfing the web I have come across algorithms like Boyer-Moore and Aho-Corasick and some suggestion to tweak into what algorithms Lucene Apache implements etc. What do you suggest here?

Implementing a document search engine

Answers (1)

Related Questions