abhishek jha
abhishek jha

Reputation: 1095

Give custom job_id to Google Dataproc cluster for running pig/hive/spark jobs

Is there any flag available to give custom job_id to dataproc jobs. I am using this command to run pig jobs.

gcloud dataproc jobs submit pig --cluster my_cluster --file my_queries.pig

I use similar commands to submit pyspark/hive jobs.

This command creates a job_id on its own and tracking them later on is difficult.

Upvotes: 0

Views: 3443

Answers (2)

Pedro Fillastre
Pedro Fillastre

Reputation: 922

Reading the gcloud code you can see that the args called id is used as job name

https://github.com/google-cloud-sdk/google-cloud-sdk/blob/master/lib/googlecloudsdk/command_lib/dataproc/jobs/submitter.py#L56

therefore you only need to add the --id to you gcloud command

gcloud dataproc jobs submit spark --id this-is-my-job-name --cluster my-cluster --class com.myClass.Main --jars gs://my.jar

Upvotes: 5

Dennis Huo
Dennis Huo

Reputation: 10677

While it's possible to provide your own generated jobid when using the underlying REST API, there isn't currently any way to specify your own jobid when submitting with gcloud dataproc jobs submit; this feature might be added in the future. That said, typically when people want to specify job ids they also want to be able to list with more complex match expressions, or potentially to have multiple categories of jobs listed by different kinds of expressions at different points in time.

So, you might want to consider dataproc labels instead; labels are intended specifically for this kind of use case, and are optimized for efficient lookup. For example:

gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170508 ...
gcloud dataproc jobs submit pig --labels jobtype=mylogspipeline,date=20170509 ...
gcloud dataproc jobs submit pig --labels jobtype=mlpipeline,date=20170509 ...

gcloud dataproc jobs list --filter "labels.jobtype=mylogspipeline"
gcloud dataproc jobs list --filter "labels.date=20170509"
gcloud dataproc jobs list --filter "labels.date=20170509 AND labels.jobtype=mlpipeline"

Upvotes: 1

Related Questions