Frank
Frank

Reputation: 404

How to retrieve Dataproc's jobId within a PySpark job

I run several batch jobs and I would like to reference the jobId from dataproc to the saved output files.

That would allow to have all logs for arguments and output associated with the results. Downside remains: As executors in YARN past away, no logs for the single executor can be obtained anymore.

Upvotes: 1

Views: 722

Answers (2)

Shirin Yavari
Shirin Yavari

Reputation: 692

This is the answer in Python if anyone is interested:

import pyspark
sc = pyspark.SparkContext()

def extract_jobid(sc):
    # Access the underlying SparkConf
    spark_conf = sc.getConf()

    # Get the value of spark.yarn.tags configuration
    spark_conf = spark_conf.get("spark.yarn.tags")

    # Extract the jobId from yarn_tags using string processing
    # assuming yarn_tags format: "dataproc_job_<job_id>"
    job_id = None
    if yarn_tags:
        tags = yarn_tags.split(",")
        for tag in tags:
            if (tag.startswith("dataproc_job_") and not tag.startswith("dataproc_job_attempt_timestamp_")):
                job_id = tag.split("_")[2]
                break
    return job_id 

 # Simply call to the function to output the dataproc jobId
 extract_jobid(sc)

Upvotes: -1

Frank
Frank

Reputation: 404

The context of Google dataproc is passed into Spark jobs by using tags. Therefore all suitable information are present in the SparkConfig and can be accessed:

pyspark.SparkConf().get("spark.yarn.application.tags", "unknown")
pyspark.SparkConf().get("spark.yarn.tags", "unknown")

Output looks the following:

dataproc_job_3f4025a0-bce1-a254-9ddc-518a4d8b2f3d

That information can then be assigned to our export folder and output is saved with Dataproc reference:

df.select("*").write. \
    format('com.databricks.spark.csv').options(header='true') \
    .save(export_folder)

Upvotes: 7

Related Questions