Reputation: 81
I have requirement to monitor S3 bucket for files (zip) to be placed. As soon as a file is placed in S3 bucket, the pipeline should start processing the file. Currently I have Workflow Job with multiple tasks the performs processing. In Job parameter, I have configured S3 bucket file path and able to trigger pipeline. But I need to automate the monitoring through Autoloader. I have setup Databricks autoloader in another notebook and managed to get the list of files that are arriving S3 path by querying checkpoint.
checkpoint_query = "SELECT * FROM cloud_files_state('%s') ORDER BY create_time DESC LIMIT 1" % (checkpoint_path)
But I want to integrate this notebook with my job but no clue how to integrate it with pipeline job. Some pointers to help will be much appreciable.
Upvotes: 1
Views: 1181
Reputation: 91
You need to create a workflow job and add the pipeline as the upstream task and your notebook as the downstream. Currently there is no way to run custom notebooks within a dlt pipeline.
Check this for how to create a workflow: https://docs.databricks.com/workflows/jobs/jobs.html#job-create
Upvotes: 0