Synapse notebook script runs fine but stops on time-out and stucks queued in pipelines

Question

I have a python script in a notebook that performs a parquet file schema correction. It is working fine, runs in less than 10 seconds, depending on the number of files to process. Not exactly fine, more on this later below.

Now I need to run it in a pipeline that correct dataset schemas and then send the data to another place. So I used a Notebook activity and linked it to the notebook I already have and configure the pool in the same way.

Running the pipeline lasts more than 30 minutes and the Spark application appears to be stucked in QUEUED state. I guess it will time out some day. I actually didn't allow it to time out, 30 minutes for a script that runs under 10 seconds is clear sign of something going wrong.

I run the notebook directly again, and it goes along the states pretty well. however, even when the script finishes (the last print line of code run), it is still showing as running in the apache spark applications page, and it keeps running until it is stopped because of a time out. "this application failed due to the total number of errors:1.

Error details This application failed due to the total number of errors: 1. Error code 1 LIVY_JOB_TIMED_OUT

Message Job failed during run time with state=[dead].

Source Unknown

The last cell code is as follows:

# Usage
schema_path = f"{blob_relative_path}/person.schema.parquet"  # Example path
file_paths = [
    f"{blob_relative_path}/person.0.parquet",
    f"{blob_relative_path}/person.1.parquet",
    f"{blob_relative_path}/person.2.parquet",
    f"{blob_relative_path}/person.3.parquet"
]

print(f"reading schema template file...")
schema_df = read_parquet(schema_path)
print(f"This schema will be used as the schema template for the rest of the files")

print(f"Starting standardization")
for path in file_paths:
    df = read_parquet(path)
    print(f"file {path} loaded")
    df = standardize_schema(df, schema_df)
    print(f"file {path} standardized")
    df.info()
    write_parquet(df, path)

print(f"All files are standardized.")

I don't know what needs to be done for the job to finish when the script runs and prevent the application to time out or if this is the expected behaviour: to keep running after script completion until application times out. Could that has something to do with the behaviour of the pipeline being stuck in queued? How can I move forward and make the notebook run correctly alone and in the pipeline?

Synapse notebook script runs fine but stops on time-out and stucks queued in pipelines

Answers (1)

Related Questions