Reputation: 1820
I am trying to learn Spark framework. On its homepage https://spark.apache.org/ it is said that it is better than Hadoop framework. But then they say: Spark runs on Hadoop... I really don't understand why it is possible to run on Hadoop when it should be better than Hadoop.
Can someone explain me the hierarchy between those two?
Upvotes: 2
Views: 1025
Reputation: 1463
Apache Hadoop (2.0) provides two major components, (1) HDFS
the Hadoop Distributed File System, for storing data (i.e. files) on a cluster, and (2) YARN
a cluster compute resource management system (i.e. CPUs/RAM).
Hadoop 2.0:
Hadoop (2.0) also provides an execution engine called `MapReduce (MR2 - MapReduce2)' that can use YARN compute resources to execute MapReduce based programs.
Prior to Hadoop (2.0), YARN did not exist, and MapReduce performed both roles of resource management an execution engine. Hadoop (2.0) decoupled compute resource management from execution engines, allowing you to run many types of applications on a Hadoop cluster.
A few Hadoop 2.0 Execution Engine Examples:
YARN Resources used to run MapReduce2 (MR2)
YARN Resources used to run Spark
YARN Resources used to run Tez
Spark programs need resources to run and they typically come from either a Spark-standalone cluster, or they get their resources by using YARN resources from a Hadoop cluster; there are other ways to run Spark, but they are not necessary for discussion here.
Like MapReduce programs, Spark programs can also access data stored in HDFS or in other places.
Upvotes: 6
Reputation: 9411
Main components of Hadoop are resource manager (YARN), distributed file system (HDFS) and distributed workflow framework (MapReduce).
Spark can run on Hadoop using Yarn, but Spark doesn't use HDFS or MapReduce, instead it uses something called DAG (directed acyclic graph) to plan jobs and tries to store as much data in memory (instead of file system) as it can. This makes Spark faster in most scenarios.
Spark can also operate in a stand-alone mode without a dedicated Hadoop cluster, so these two are not tied together 100%.
Upvotes: 0
Reputation: 4009
I think this will help you understand better the relation between Spark and Haddop:
Hadoop is essentially a distributed data infrastructure: It distributes massive data collections across multiple nodes within a cluster of commodity servers, which means you don't need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously.
Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn't do distributed storage.
For further information read this.
Upvotes: 2