Reputation: 563
I'd like to get command line access to the live logs produced by my Spark app when I'm SSH'd into the master node (the machine hosting the Spark driver program). I'm able to see them using gcloud dataproc jobs wait
, the Dataproc web UI, and in GCS, but I'd like to be able to access the live log via command line so I can grep
, etc. through it.
Where can I find the logs produced by Spark on the driver (and on the executors too!)?
Upvotes: 0
Views: 1682
Reputation: 10677
At the moment, Dataproc doesn't actually tee out any duplicate copy of the driver output to local disk vs just placing it in GCS, in part because it doesn't quite fit into standard log-rotation policies or YARN task log cleanup, so it requires extra definitions of how to perform garbage-collection of these output files on the local disk or otherwise risking slowly running out of disk space on a longer-lived cluster.
That said, such deletion policies are certainly surmountable, so I'll go ahead and add this as a feature request to tee the driver output out to both GCS and a local disk file for better ease-of-use.
In the meantime though, you have a couple options:
cloud-platform
scope when creating your cluster (gcloud dataproc clusters create --scopes cloud-platform
) and then even on the cluster you can gcloud dataproc jobs wait <jobid> | grep foo
gsutil cat
; if you can gcloud dataproc jobs describe
from another location first to find the driverOutputResourceUri
field, this points at the GCS prefix (which you probably already found since you mentioned finding them in GCS). Since the output parts are named with a padded numerical prefix, gsutil cat gs://bucket/google-cloud-dataproc-metainfo/cluster-uuid/jobs/jobid/driveroutput*
will print out the job output in the correct order, and then you can pipe that into whatever you need.Upvotes: 1