Ricker Silva
Ricker Silva

Reputation: 1135

Synapse notebook script runs fine but stops on time-out and stucks queued in pipelines

I have a python script in a notebook that performs a parquet file schema correction. It is working fine, runs in less than 10 seconds, depending on the number of files to process. Not exactly fine, more on this later below.

Now I need to run it in a pipeline that correct dataset schemas and then send the data to another place. So I used a Notebook activity and linked it to the notebook I already have and configure the pool in the same way.

enter image description here

Running the pipeline lasts more than 30 minutes and the Spark application appears to be stucked in QUEUED state. I guess it will time out some day. I actually didn't allow it to time out, 30 minutes for a script that runs under 10 seconds is clear sign of something going wrong.

I run the notebook directly again, and it goes along the states pretty well. however, even when the script finishes (the last print line of code run), it is still showing as running in the apache spark applications page, and it keeps running until it is stopped because of a time out. "this application failed due to the total number of errors:1.

Error details This application failed due to the total number of errors: 1. Error code 1 LIVY_JOB_TIMED_OUT

Message Job failed during run time with state=[dead].

Source Unknown

The last cell code is as follows:

# Usage
schema_path = f"{blob_relative_path}/person.schema.parquet"  # Example path
file_paths = [
    f"{blob_relative_path}/person.0.parquet",
    f"{blob_relative_path}/person.1.parquet",
    f"{blob_relative_path}/person.2.parquet",
    f"{blob_relative_path}/person.3.parquet"
]

print(f"reading schema template file...")
schema_df = read_parquet(schema_path)
print(f"This schema will be used as the schema template for the rest of the files")

print(f"Starting standardization")
for path in file_paths:
    df = read_parquet(path)
    print(f"file {path} loaded")
    df = standardize_schema(df, schema_df)
    print(f"file {path} standardized")
    df.info()
    write_parquet(df, path)

print(f"All files are standardized.")

I don't know what needs to be done for the job to finish when the script runs and prevent the application to time out or if this is the expected behaviour: to keep running after script completion until application times out. Could that has something to do with the behaviour of the pipeline being stuck in queued? How can I move forward and make the notebook run correctly alone and in the pipeline?

Upvotes: 0

Views: 45

Answers (1)

As you mentioned that you have 2 issues Spark Application Stuck in QUEUED State & Application Timing Out After Script Completion

In order to resolve the issue with respect to Spark Application Stuck in QUEUED State. Spark cluster might not have enough resources available to start the job. Check the cluster resource usage and ensure there are enough resources Example CPU, memory available.

Below is the code you can use to increase executor & driver Memory:

spark.conf.set("spark.executor.memory", "4g")
spark.conf.set("spark.driver.memory", "4g")
spark.conf.set("spark.memory.offHeap.enabled", "true")
spark.conf.set("spark.memory.offHeap.size", "4g")

If you need more memory than what is available on-heap, You can use off-heap memory by setting the spark.memory.offHeap.enabled and spark.memory.offHeap.size parameters.

Regarding the second issue Application Timing Out After Script Completion

Since the notebook does not explicitly stop the Spark session, it stays active in the Spark UI until it times out. To avoid this, make sure to call spark.stop() at the end of the script. If you're using a Notebook Activity, configure it to properly handle session termination. You can use mssparkutils.session.stop() to end the session and free up resources.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.stop()

Reference: How to handle Azure Databricks and Synapse session timeout issues

Upvotes: 0

Related Questions