Reputation: 340
Apache Spark [http://spark.apache.org/] claims to be 100x faster than Apache Hadoop in memory. How does it achieve this phenomenal speedup? It this speedup only applicable for iterative machine learning algorithms or also for ETL (extract-transform-load) tasks like JOINs and GROUPBYs? Can Spark's RDDs (Resilient Distributed Datasets) and DataFrames both provide this kind of a speedup? Are there any benchmark test results obtained by the Spark community for some of the scenarios described above?
Upvotes: 1
Views: 1215
Reputation: 38950
Spark is faster than Hadoop due to in-memory processing. But there are some twisted facts about the numbers.
Still Spark has to rely on HDFS for some use cases.
Have a look at this slide and especially slide no: 6 and this benchmarking article
Have a look at complete presentation.
Spark is faster for real time analytics due to in-memory processing. Hadoop is good for batch processing. If you are not worried about the latency of job, still you can use Hadoop.
But one thing is sure. Spark and Hadoop have to co-exist. Neither of them can replace other.
Upvotes: 1
Reputation: 9291
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Hope this helps.
Upvotes: 0
Reputation: 3956
Problem with most of these claims are not bench marked against true production use cases. There might be volume, but not quality in the data which can represent actual business applications. Spark can be very handy for streaming analytics, where you want to understand data in near real time. But for true batch processing Hadoop can be better solution, especially on commodity hardware.
Upvotes: 2