Reputation: 209
I am working on a search project to build a search engine that searches millions of documents, help needed regarding what are the already existing best ways to do the same, starting point etc. I have also tried ElasticSearch and Apache SOLR for say about 10 million documents, but they are taking time is seconds (2-4 seconds).
Upvotes: 0
Views: 2188
Reputation: 86
ElasticSearch is built on top of Lucene, and it mainly focuses on "elasticity" of your engine. If each document is not large, and the 10M documents can fit into memory, then you may consider advanced solutions such as SRCH2 that can support a search in milliseconds with many advanced features.
Upvotes: 0
Reputation: 42899
Sphinx ( http://sphinxsearch.com/ ) is another software dedicated to full-text search with a set of features close to Lucene, except it is a standalone server with client-side apis and bindings for several languages.
Some high-profile websites such as craiglist use it as a search engine with very good results, as mentioned on the website:
Craigslist.org, a free classified ads site, is rumored to fire around 250,000,000 million queries/day against Sphinx. Believe it or not, this is accomplished with 15 clustered Sphinx boxes, and at peak times only consumes a 1/4 of their total capacity.
Upvotes: 1
Reputation: 11923
For millions of documents and a decently fast full text search you will not get around a proper search engine using methodologies like Term Document Matrix or other kind of inverted indexing.
I'd suggest reading up on the full-text search engine basics to get he most essential ideas, then look for a good library that does what you need. (I would not suggest writing your own search engine if you're not prepared to invest some serious time.)
Recommended reading:
(Not sure you needed these pointers, if you know about these things already, good for you. ;))
=> As for actual suggestions on what to use: I had success using Apache's Lucene, an full-text search engine library for Java. It provides great help with document indexing, tokenization, word stemming, stop words, etc. It also enables you to stich your searches together from logically combined keywords (e.g. seach for 'foo' but show only docs which do not contain 'bar' or 'qux, etc.).
At the time I indexed a couple of million documents and was able to get search results in very short time, i.e. with no noticable delay.
Upvotes: 2