MacakM
MacakM

Reputation: 1820

How is Spark different from Hadoop?

I am trying to learn Spark framework. On its homepage https://spark.apache.org/ it is said that it is better than Hadoop framework. But then they say: Spark runs on Hadoop... I really don't understand why it is possible to run on Hadoop when it should be better than Hadoop.

Can someone explain me the hierarchy between those two?

Upvotes: 2

Views: 1025

Answers (3)

Andrew Mo
Andrew Mo

Reputation: 1463

Apache Hadoop (2.0) provides two major components, (1) HDFS the Hadoop Distributed File System, for storing data (i.e. files) on a cluster, and (2) YARN a cluster compute resource management system (i.e. CPUs/RAM).

Hadoop 2.0:

  • Storage Management: HDFS
  • Compute Resource Management: YARN

Hadoop (2.0) also provides an execution engine called `MapReduce (MR2 - MapReduce2)' that can use YARN compute resources to execute MapReduce based programs.

Prior to Hadoop (2.0), YARN did not exist, and MapReduce performed both roles of resource management an execution engine. Hadoop (2.0) decoupled compute resource management from execution engines, allowing you to run many types of applications on a Hadoop cluster.

  • When people state that Spark is better than Hadoop, they are typically referring to the MapReduce execution engine.
  • When people state that Spark can run on Hadoop (2.0), they are typically referring to Spark using YARN compute resources.

A few Hadoop 2.0 Execution Engine Examples:

  • YARN Resources used to run MapReduce2 (MR2)

  • YARN Resources used to run Spark

  • YARN Resources used to run Tez

Spark programs need resources to run and they typically come from either a Spark-standalone cluster, or they get their resources by using YARN resources from a Hadoop cluster; there are other ways to run Spark, but they are not necessary for discussion here.

Like MapReduce programs, Spark programs can also access data stored in HDFS or in other places.

Upvotes: 6

Sergey Kovalev
Sergey Kovalev

Reputation: 9411

Main components of Hadoop are resource manager (YARN), distributed file system (HDFS) and distributed workflow framework (MapReduce).

Spark can run on Hadoop using Yarn, but Spark doesn't use HDFS or MapReduce, instead it uses something called DAG (directed acyclic graph) to plan jobs and tries to store as much data in memory (instead of file system) as it can. This makes Spark faster in most scenarios.

Spark can also operate in a stand-alone mode without a dedicated Hadoop cluster, so these two are not tied together 100%.

Upvotes: 0

Gal Dreiman
Gal Dreiman

Reputation: 4009

I think this will help you understand better the relation between Spark and Haddop:

Hadoop is essentially a distributed data infrastructure: It distributes massive data collections across multiple nodes within a cluster of commodity servers, which means you don't need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously.

Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn't do distributed storage.

For further information read this.

Upvotes: 2

Related Questions