Reputation: 209

What is the best way to search millions of documents?

I am working on a search project to build a search engine that searches millions of documents, help needed regarding what are the already existing best ways to do the same, starting point etc. I have also tried ElasticSearch and Apache SOLR for say about 10 million documents, but they are taking time is seconds (2-4 seconds).

Upvotes: 0

Answers (3)

Chen Li

Reputation: 86

ElasticSearch is built on top of Lucene, and it mainly focuses on "elasticity" of your engine. If each document is not large, and the 10M documents can fit into memory, then you may consider advanced solutions such as SRCH2 that can support a search in milliseconds with many advanced features.

Upvotes: 0

SirDarius

Reputation: 42899

Sphinx ( http://sphinxsearch.com/ ) is another software dedicated to full-text search with a set of features close to Lucene, except it is a standalone server with client-side apis and bindings for several languages.

Some high-profile websites such as craiglist use it as a search engine with very good results, as mentioned on the website:

Craigslist.org, a free classified ads site, is rumored to fire around 250,000,000 million queries/day against Sphinx. Believe it or not, this is accomplished with 15 clustered Sphinx boxes, and at peak times only consumes a 1/4 of their total capacity.

Upvotes: 1

fgysin

Reputation: 11923

For millions of documents and a decently fast full text search you will not get around a proper search engine using methodologies like Term Document Matrix or other kind of inverted indexing.

I'd suggest reading up on the full-text search engine basics to get he most essential ideas, then look for a good library that does what you need. (I would not suggest writing your own search engine if you're not prepared to invest some serious time.)

What is the best way to search millions of documents?

Answers (3)

Related Questions