Reputation: 2399
I need some good references for using Hadoop for real-time systems like searching with little response time. I know hadoop has its overhead of hdfs, but whats the best way of doing this with hadoop.
Upvotes: 3
Views: 3153
Reputation: 11
This query is old but it begs an answer. Even if there are millions of documents but are not changing in real-time like FAQ docs, Lucene + SOLR for distribution should pretty much suffice the need. Hathi Trust indexes billions of documents using the same combination.
It is a completely different problem if the index is changing in real time. Even Lucene will have problems dealing with updating its index and you have to look at real time search engines. There has been some attempt at reworking Lucene for real time and maybe it should work. You can also look at HSearch, a real time distributed search engine built on Hadoop and HBase, hosted at http://bizosyshsearch.sourceforge.net
Upvotes: 1
Reputation: 19676
You need to provide a lot more information about the goals and challenges of your system to get good advice. Perhaps Hadoop is not what you need, and you just require some distributed systems foo? (Oh and are you totally sure you require a distributed system? There's an awful lot you can do with a replicated database on top of a couple of large-memory machines).
Knowing nothing about your problem, I'll give you are few shot-in-the-dark attempts at answering.
If you really really need to do serious processing at query time, the way to do that is to run dedicated processes that do the specific kinds of computations you need, and use something like Thrift to send requests for computation and receive results back. Optimize them to have all the needed data in-memory. The process that receives the query itself can then do nothing more than break the problem into pieces, send the pieces to compute nodes, and collect the results. This sounds like Hadoop, but is not because it's made for computation of specific problems with pre-loaded data rather than a generic computation model for arbitrary computing.
Upvotes: 10
Reputation: 186118
Hadoop is completely the wrong tool for this kind of requirement. It is explicitly optimised for large batch jobs that run for several minutes up to hours or even days.
FWIW, HDFS has nothing to do with the overhead. It's the fact that Hadoop jobs deploy a jar file onto every node, setup a working area, start each job running, pass information via files between stages of the computation, communicate progress and status with the job runner, etc., etc.
Upvotes: 5