Reputation: 53916
In the past for job that required a heavy processing load I would use Scala and parallel collections.
I'm currently experimenting with Spark and find it interesting but a steep learning curve. I find the development slower as have to use a reduced Scala API.
What do I need to determine before deciding wether or not to use Spark ?
The current Spark job im trying to implement is processing approx 5GB if data. This data is not huge but I'm running a Cartesian product of this data and this is generating data in excess of 50GB. But maybe using Scala parallel collecitons will be just as fast, I know the dev time to implement the job will be faster from my point of view.
So what considerations should I take into account before deciding to use Spark ?
Upvotes: 2
Views: 220
Reputation: 3388
The main advantages Spark has over traditional high-performance computing frameworks (e.g. MPI) are fault-tolerance, easy integration into the Hadoop stack, and a remarkably active mailing list http://mail-archives.apache.org/mod_mbox/spark-user/ . Getting distributed fault-tolerant in-memory computations to work efficiently isn't easy and it's definitely not something I'd want to implement myself. There's a review of other approaches to the problem in the original paper: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf .
However, when my work is I/O bound, I still tend to rely primarily on pig scripts as pig is more mature and I think the scripts are easier to write. Spark has been great when pig scripts won't cut it (e.g. iterative algorithms, graphs, lots of joins).
Now, if you've only got 50g of data, you probably don't care about distributed fault-tolerant computations (if all your stuff is on a single node then there's no framework in the world that can save you from a node failure :) ) so parallel collections will work just fine.
Upvotes: 2