Spark save files distributedly

Question

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.

I am currently working on a large dataset that, once processed, outputs even a bigger amount of data, which needs to be stored in text files, as done with the command saveAsTextFile(path).

So far I have been using this method; however, since it is an action (as stated above) and not a transformation, Spark needs to send data from every partition to the driver node, thus slowing down the process of saving quite a bit.

I was wondering if any distributed file saving method (similar to saveAsTextFile()) exists on Spark, enabling each executor to store its own partition by itself.

Spark save files distributedly

Answers (1)

Related Questions