Reputation: 3129
I am trying to run a Spark job over data from multiple Cassandra tables which are grouped as part of the job. I am trying to get an end to end run with a huge data set 13m data points and it has failed over multiple points. As I fix those failures and move ahead, I encounter the next problem which I fix and restart the job again. Is there a way to speed up the testing cycle on real data so that I can restart/resume a previously failed job from a specific checkpoint?
Upvotes: 2
Views: 2658
Reputation: 1581
You can checkpoint your RDDs to disk at various midpoints, which would let you restart from there if necessary. You would have to save the intermediates as a sequence file or text file, and do a little work to make sure everything goes to and from disk cleanly.
I find it more useful to start up the spark-shell and build my data flow in there. If you can identify a subset of your data which is representative, even better. Once you get into the REPL you can create RDDs, check the first value or take(100) and print them to stdout, count various result data sets, and so on. The REPL is what makes spark 10x more productive than hadoop for me.
Once I have built, in the REPL, a flow of transformations and actions that gets me the result that I need, then I can form it into a scala file and refactor that to be clean; extract functions that can be reused and unit tested, tune the parallelism, whatever.
I often find myself going back into the REPL when I need to extend my data flow, so I copy and paste code from my scala file to get to a good starting point, and experiment with the extension from there.
Upvotes: 5