How is Apache Spark different from the Hadoop approach?

Question

Everyone is saying that Spark is using the memory and because of that it's much faster than Hadoop.

I didn't understand from the Spark documentation what the real difference is.

Where does Spark stores the data in memory while Hadoop doesn't?
What happens if the data is too big for the memory? How similar would it be to Hadoop in that case?

Kishore · Accepted Answer

Spark tries to keep things in memory, whereas MapReduce keeps shuffling things in and out of disk. Mean intermediate output store in main memory where as hadoop store intermediate result in secondary memory. MapReduce inserts barriers, and it takes a long time to write things to disk and read them back. Hence MapReduce can be slow and laborious. The elimination of this restriction makes Spark orders of magnitude faster. For things like SQL engines such as Hive, a chain of MapReduce operations is usually needed, and this requires a lot of I/O activity. On to disk, off of disk—on to disk, off of disk. When similar operations are run on Spark, Spark can keep things in memory without I/O, so you can keep operating on the same data quickly. This results in dramatic improvements in performance, and that means Spark definitely moves us into at least the interactive category. For the record, there are some benefits to MapReduce doing all that recording to disk — as recording everything to disk allows for the possibility of restarting after failure. If you’re running a multi-hour job, you don’t want to begin again from scratch. For applications on Spark that run in the seconds or minutes, restart is obviously less of an issue.

It’s easier to develop for Spark. Spark is much more powerful and expressive in terms of how you give it instructions to crunch data. Spark has a Map and a Reduce function like MapReduce, but it adds others like Filter, Join and Group-by, so it’s easier to develop for Spark.

Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL

How is Apache Spark different from the Hadoop approach?

Answers (2)

Related Questions