spark save does it collect and save or save from each node

Question

We have a spark cluster with 10 nodes. I have a process that joins couple of dataframes and then saves the result to a s3 location. We are running in cluster mode. When I call save on the dataframe, does it save from the nodes or does it collect all the result to the driver and write it from the driver to the s3. is there a way to verify this.

stevel · Accepted Answer

RDD.save() triggers an evaluation of the entire query.

The work is partitioned by source data (i.e. files), and any splitting which can be done, pushing individual tasks to available executors, collecting the results and finally writing it to the destination directory using the cross-node protocol defined in implementations of FileCommitProtocol, generally HadoopMapReduceCommitProtocol, which then works with Hadoop's FileOutputCommitter to choreograph the commit.

Essentially:

Tasks write to their task-specific subdir under __temporary/$job-attempt/$task-attempt
tasks say they are ready to write, the spark driver tells them to commit vs abort
in speculative execution or failure conditions, tasks can abort, in which case they delete their temp dir
on a commit, the task lists files in its dir and renames them to a job attempt dir, or direct to the destination (v2 protocol)
On a job commit, the driver either lists and renames files in the job attempt dir (v1 protocol), or is a no-op (v2).

On the question of "writing to s3", if you are using Apache Spark (and not amazon EMR), then be aware that this list + rename commit mechanism is (a) slow as renames are really copies, and (b) dangerous as the eventual consistency of S3, especially list inconsistency, means that files saved by tasks may not be listed, hence not committed

At the time of writing (May 2017), the sole committer known to safely commit using s3a or s3n clients is the netflix committer.

There's work underway to pull this into Hadoop and hence spark, but again, in May 2017 it's still a work in progress: demo state only. I speak as the engineer doing this.

To close then: If you want reliable data output writing to S3, consult whoever is hosting your code on EC2. If you are using out-the-box Apache Spark without any supplier-specific code, do not write direct to S3. It may work in tests, but as well as seeing the intermittent failure, you may lose data and not even notice. Statistics are your enemy here: the more work you do, the bigger the datasets, the more tasks that are executed -and so eventually something will go wrong.

spark save does it collect and save or save from each node

Answers (1)

Related Questions