akshat thakar
akshat thakar

Reputation: 1526

When should we go for Apache Spark

Would it be wise to replace MR completely with Spark. Here are the areas where we still use MR and need your input to go ahead with Apache Spark option-

Upvotes: 3

Views: 807

Answers (1)

Patrick McGloin
Patrick McGloin

Reputation: 2234

ETL - Spark has much less boiler-plate code needed than MR. Plus you can code in Scala, Java and Python (not to mention R, but probably not for ETL). Scala especially, makes ETL easy to implement - there is less code to write.

Machine Learning - ML is one of the reasons Spark came about. With MapReduce, the HDFS interaction makes many ML programs very slow (unless you have some HDFS caching, but I don't know much about that). Spark can run in-memory so you can have programs build ML models with different parameters to run recursively against a dataset which is in-memory, so no file system interaction (except for the initial load).

NoSQL - There are many NoSQL datasources which can easily be plugged into Spark using SparkSQL. Just google which one you are interested in, it's probably very easy to connect.

Stream Processing - Spark Streaming works in micro-batches and one of the main selling points of Storm over Spark Streaming is that it is true streaming rather than micro batches. As you are already using batches Spark Streaming should be a good fit.

Hive Query - There is a Hive on Spark project which is going on. Check the status here. It will allow Hive to execute queries via your Spark Cluster and should be comparable to Hive on Tez.

Upvotes: 4

Related Questions