Where does Google Dataproc store Spark logs on disk?

Question

I'd like to get command line access to the live logs produced by my Spark app when I'm SSH'd into the master node (the machine hosting the Spark driver program). I'm able to see them using gcloud dataproc jobs wait, the Dataproc web UI, and in GCS, but I'd like to be able to access the live log via command line so I can grep, etc. through it.

Where can I find the logs produced by Spark on the driver (and on the executors too!)?

Dennis Huo · Accepted Answer

At the moment, Dataproc doesn't actually tee out any duplicate copy of the driver output to local disk vs just placing it in GCS, in part because it doesn't quite fit into standard log-rotation policies or YARN task log cleanup, so it requires extra definitions of how to perform garbage-collection of these output files on the local disk or otherwise risking slowly running out of disk space on a longer-lived cluster.

That said, such deletion policies are certainly surmountable, so I'll go ahead and add this as a feature request to tee the driver output out to both GCS and a local disk file for better ease-of-use.

In the meantime though, you have a couple options:

Enable the cloud-platform scope when creating your cluster (gcloud dataproc clusters create --scopes cloud-platform) and then even on the cluster you can gcloud dataproc jobs wait | grep foo
Alternatively, use gsutil cat; if you can gcloud dataproc jobs describe from another location first to find the driverOutputResourceUri field, this points at the GCS prefix (which you probably already found since you mentioned finding them in GCS). Since the output parts are named with a padded numerical prefix, gsutil cat gs://bucket/google-cloud-dataproc-metainfo/cluster-uuid/jobs/jobid/driveroutput* will print out the job output in the correct order, and then you can pipe that into whatever you need.

Where does Google Dataproc store Spark logs on disk?

Answers (1)

Related Questions