Mikel Urkia
Mikel Urkia

Reputation: 2095

Spark save files distributedly

According to the Spark documentation,

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.

I am currently working on a large dataset that, once processed, outputs even a bigger amount of data, which needs to be stored in text files, as done with the command saveAsTextFile(path).

So far I have been using this method; however, since it is an action (as stated above) and not a transformation, Spark needs to send data from every partition to the driver node, thus slowing down the process of saving quite a bit.

I was wondering if any distributed file saving method (similar to saveAsTextFile()) exists on Spark, enabling each executor to store its own partition by itself.

Upvotes: 5

Views: 2344

Answers (1)

kuujo
kuujo

Reputation: 8195

I think you're misinterpreting what it means to send a result to the driver. saveAsTextFile does not send the data back to the driver. Rather, it sends the result of the save back to the driver once it's complete. That is, saveAsTextFile is distributed. The only case where it's not distributed is if you only have a single partition or you've coallesced your RDD back to a single partition before calling saveAsTextFile.

What that documentation is referring to is sending the result of saveAsTextFile (or any other "action") back to the driver. If you call collect() then it will indeed send the data to the driver, but saveAsTextFile only sends a succeed/failed message back to the driver once complete. The save itself is still done on many nodes in the cluster, which is why you'll end up with many files - one per partition.

IO is always expensive. But sometimes it can seem as if saveAsTextFile is even more expensive precisely because of the lazy behavior described in that excerpt. Essentially, when saveAsTextFile is called, Spark may perform many or all of the prior operations on its way to being saved. That is what is meant by laziness.

If you have the Spark UI set up it may give you better insight into what is happening to the data on its way to a save (if you haven't already done that).

Upvotes: 8

Related Questions