When should we go for Apache Spark

Question

Would it be wise to replace MR completely with Spark. Here are the areas where we still use MR and need your input to go ahead with Apache Spark option-

ETL : Data validation and transformation. Sqoop and custom MR programs using MR API.
Machine Learning : Mahout algorithms to arrive at recommendations, classification and clustering
NoSQL Integration : Interfacing with NoSQL Databases using MR API
Stream Processing : We are using Apache Storm for doing stream processing in batches.
Hive Query : We are already using Tez engine for speeding up Hive queries and see 10X performance improvement when compared with MR engine

Patrick McGloin · Accepted Answer

ETL - Spark has much less boiler-plate code needed than MR. Plus you can code in Scala, Java and Python (not to mention R, but probably not for ETL). Scala especially, makes ETL easy to implement - there is less code to write.

Machine Learning - ML is one of the reasons Spark came about. With MapReduce, the HDFS interaction makes many ML programs very slow (unless you have some HDFS caching, but I don't know much about that). Spark can run in-memory so you can have programs build ML models with different parameters to run recursively against a dataset which is in-memory, so no file system interaction (except for the initial load).

NoSQL - There are many NoSQL datasources which can easily be plugged into Spark using SparkSQL. Just google which one you are interested in, it's probably very easy to connect.

Stream Processing - Spark Streaming works in micro-batches and one of the main selling points of Storm over Spark Streaming is that it is true streaming rather than micro batches. As you are already using batches Spark Streaming should be a good fit.

Hive Query - There is a Hive on Spark project which is going on. Check the status here. It will allow Hive to execute queries via your Spark Cluster and should be comparable to Hive on Tez.

When should we go for Apache Spark

Answers (1)

Related Questions