Garfield
Garfield

Reputation: 436

Mapreduce Vs Spark Vs Storm Vs Drill - For Small files

I know spark does the in memory computation and is much faster then MapReduce. I was wondering how well does spark work for say records < 10000 ? I have huge number of files around ( each file having around 10000 records , say 100 column file) coming into my hadoop data platform and i need to perform some data quality checks before i load then into hbase.

I do the data quality check in hive which uses MapReduce at the back-end. For each file it takes about 8 mins and thats pretty bad for me. Will spark give me a better performance lets say 2-3 mins ?

I know I got to do a bench marking but i was trying to understand the basics here before i really get going with spark. As I recollect creating RDD's for the first time will be an overhead and since i got to create a new RDD for each incoming file that going to cost me a bit.

I am confused which would be the best approach for me - spark , drill, storm or Mapreduce itself ?

Upvotes: 1

Views: 659

Answers (1)

Dev
Dev

Reputation: 13753

I am just exploring the performance of Drill vs Spark vs Hive over around millions of records. Dill & Spark both are around 5-10 times faster in my case (I did not perform any performance test over cluster with significant RAM, I just tested on single node) The reason for fast computation - both of them perform the in-memory computation.

The performance of drill & spark is almost comparable in my case. So, I can't say which one is better. You need to try this at your end.

Testing on Drill will not take much time. Download the latest drill, install on your mapr hadoop cluster, add hive-storage plugin and perform the query.

Upvotes: 1

Related Questions