Reputation: 20264
It looks like there are two ways to use spark as the backend engine for Hive.
The first one is directly using spark
as the engine. Like this tutorial.
Another way is to use spark
as the backend engine for MapReduce
. Like this tutorial.
In the first tutorial, the hive.execution.engine
is spark
. And I cannot see hdfs
involved.
In the second tutorial, the hive.execution.engine
is still mr
, but as there is no hadoop
process, it looks like the backend of mr
is spark.
Honestly, I'm a little bit confused about this. I guess the first one is recommended as mr
has been deprecated. But where is the hdfs
involved?
Upvotes: 0
Views: 630
Reputation: 117
Apache Spark builds DAG(Directed acyclic graph) whereas Mapreduce goes with native Map and Reduce. While execution in Spark, logical dependencies form physical dependencies.
Now what is DAG?
DAG is building logical dependencies before execution.(Think of it as a visual graph)
When we have multiple map and reduce or output of one reduce is the input to another map then DAG will help to speed up the jobs.
DAG is build in Tez (right side of photo) but not in MapReduce (left side).
NOTE: Apache Spark works on DAG but have stages in place of Map/Reduce. Tez have DAG and works on Map/Reduce. In order to make it simpler i used Map/Reduce context but remember Apache Spark have stages. But the concept of DAG remains the same.
Reason 2: Map persists its output to disk.(buffer too but when 90% of it is filled then output goes into disk) From there data goes to merge. But in Apache Spark intermediate data is persist to memory which makes it faster. Check this link for details
Upvotes: 1
Reputation: 18108
I understood it differently.
Normally Hive uses MR as execution engine, unless you use IMPALA, but not all distros have this.
But for a period now Spark can be used as execution engine for Spark.
https://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ discusses this in more detail.
Upvotes: 1