Korean_Of_the_Mountain
Korean_Of_the_Mountain

Reputation: 1577

Cloud Function GCS trigger is prematurely responding to partial file upload

Full workflow:

  1. SFTP Mirror uploads new files from SFTP to GCS Bucket
  2. New GCS Objects trigger Cloud Function
  3. Cloud Function triggers a Composer/Airflow DAG and sends it the path of new GCS object

Looking at DAG run history in Composer/Airflow UI, where there's a task failure and then immediately following by task success.

The purpose of the task is to upload a file to BQ. The path to file is provided by the Cloud Function.

There is a clear pattern where the logs of the failed task show that the task tried to process a file with pattern like my_timestamped_file_name.csv.part

The following task that succeeds show in the logs that the file it processed had the same pattern without the .part: my_timestamped_file_name.csv

It seems to me that the Cloud Function (CF) is being triggered by the partially uploaded file created by SFTP mirror instead of waiting for the file to be done uploading. Of course, when the file is completely uploaded, the .part file disappears and the task fails because it has nothing to process.

My Cloud Function's Event Type is defined as Finalize/Create. Is there a way to avoid partially uploaded files? Other than using a hacky conditional statement inside the CF to avoid files that end with .part?

Upvotes: 1

Views: 642

Answers (1)

SANN3
SANN3

Reputation: 10069

The rule we are creating is when ever a file is created it should trigger the GCF, so it is doing the job correctly. Possible solutions are

  1. Filtering the .part file in the GCF
  2. If possible, pass the temp directory as a different folder to the SFTP Mirror

Upvotes: 1

Related Questions