Reputation: 404
I run several batch jobs and I would like to reference the jobId from dataproc to the saved output files.
That would allow to have all logs for arguments and output associated with the results. Downside remains: As executors in YARN past away, no logs for the single executor can be obtained anymore.
Upvotes: 1
Views: 722
Reputation: 692
This is the answer in Python if anyone is interested:
import pyspark
sc = pyspark.SparkContext()
def extract_jobid(sc):
# Access the underlying SparkConf
spark_conf = sc.getConf()
# Get the value of spark.yarn.tags configuration
spark_conf = spark_conf.get("spark.yarn.tags")
# Extract the jobId from yarn_tags using string processing
# assuming yarn_tags format: "dataproc_job_<job_id>"
job_id = None
if yarn_tags:
tags = yarn_tags.split(",")
for tag in tags:
if (tag.startswith("dataproc_job_") and not tag.startswith("dataproc_job_attempt_timestamp_")):
job_id = tag.split("_")[2]
break
return job_id
# Simply call to the function to output the dataproc jobId
extract_jobid(sc)
Upvotes: -1
Reputation: 404
The context of Google dataproc is passed into Spark jobs by using tags. Therefore all suitable information are present in the SparkConfig and can be accessed:
pyspark.SparkConf().get("spark.yarn.application.tags", "unknown")
pyspark.SparkConf().get("spark.yarn.tags", "unknown")
Output looks the following:
dataproc_job_3f4025a0-bce1-a254-9ddc-518a4d8b2f3d
That information can then be assigned to our export folder and output is saved with Dataproc reference:
df.select("*").write. \
format('com.databricks.spark.csv').options(header='true') \
.save(export_folder)
Upvotes: 7