Reputation: 41
Oozie version 4.2.0 support Spark action which run as Spark job, is it possible to share RDD between action e.g. my one action will read file and perform some transformation and create a RDD say rdd1 and then save (Spark Action) to HDFS, now is it possible that another oozie action will take rdd1 and perform some transformation and action.
The above is possible through single Spark driver class but I am looking into oozie solution as Spark driver class will be very complicated for complex workflow.
Thanks in advance for your answer.
Regards, Gouranga Basak
Upvotes: 2
Views: 408
Reputation: 2747
One solution could be using spark jobserver to use the same spark context across multiple jobs.
Another solution could be using tachyon to do basically what you described above, and store the intermediate result into tachyon, which keeps it in memory for when the next job uses it.
However, the best way to do this is most likely to refactor your pipeline such that it can be executed within the same context, or simply deal with the performance hit. You can save an rdd into hdfs and reload it again using:
# In job 1
rdd.saveAsObjectFile("path")
# In job 2
sc.objectFile[MyClass]("path")
Upvotes: 1