Rizki Sunaryo
Rizki Sunaryo

Reputation: 107

Why Hadoop or Spark? There is ElasticSearch

Actually, there is similar question here: https://stackoverflow.com/questions/23922404/elasticsearch-hadoop-why-would-i

But the answer doesn't really satisfy me.

My questions are simple:

  1. Why should we use Hadoop or Spark, when ElasticSearch exists?
  2. What is it that Hadoop or Spark has, and ElasticSearch doesn't have?
  3. If algorithm is the answer, I believe I'm no better than Kimchy in creating algorithms. While in Hadoop or Spark, we need to create our own algorithm. Again, why still Hadoop or Spark?
  4. The answer said, "Elasticsearch is a distributed search engine and it shouldn't be used as a data warehouse."

Why shouldn't it be used as a data warehouse?

Thank you and best regards,

Rizki Sunaryo

Upvotes: 7

Views: 10670

Answers (2)

Baha
Baha

Reputation: 408

I was asking myself the same question, and I think this almost answers our question for now:

Elasticsearch had begun to expand beyond just search engine and added some features for analytics and visualization but still at its core it remains primarily a full-text search engine and provides less support for complex calculation and aggregation as part of a query.

So it depends on your use case (so much text analysis -> Elk; so much aggregations and calculations -> spark) although it's blurry:

Elasticsearch and Apache Hadoop/Spark may overlap on some very useful functionality, still each tool serves a specific purpose and we need to choose what best suites the given requirement. If we simply want to locate documents by keyword and perform simple analytics, then ElasticSearch may fit the job. If we have a huge quantity of data that needs a wide variety of different types of complex processing and analysis, then Hadoop provides the broadest range of tools and the most flexibility. But the good thing is we are not limited to use only one tool or technology at a time. We can always combine based on what we need to outcome to be. Like Hadoop and Elasticsearch are known to work best when combined. In future, these boundaries are going to be more blurring with the speed these technologies are expanding.

Reference:

https://thecustomizewindows.com/2017/02/apache-hadoop-spark-vs-elasticsearch-elk-stack/

Upvotes: 3

Matt Fortier
Matt Fortier

Reputation: 1223

I am very far from being an expert in distributed computing, but am I missing something here or are you comparing two completely different things?

Hadoop is a distributed batch computing platform, allowing you to run data extraction and transformation pipelines. ES is a search & analytic engine (or data aggregation platform), allowing you to, say, index the result of your Hadoop job for search purposes.

So a complete pipeline would be something like:

Data --> Hadoop/Spark (MapReduce or Other Paradigm) --> Curated Data --> ElasticSearch/Lucene/SOLR/etc.

You may be in situations where you just want to extract and/or transform data, and have no use of elasticsearch. You may also be in situations where your data source does not require or plays well with the distributed batch processing paradigm, in which case hadoop is no use to you.

Where you may be confused is that ES offers elasticsearch-hadoop, plugging in directly into Hadoop to offer you an "all-in-one" solution, so to speak.

Hopefully someone far more knowledgeable than me can also chip in on this.

Upvotes: 13

Related Questions