Jresearcher
Jresearcher

Reputation: 347

Databricks scheduling a workflow with different parameters for each schedule

I'm in Databricks, where I have a python notebook that takes a file from the landing, processes it and saves a delta table.

This notebook contains a parameter (time_prm) that allows me to do this option for the different versions of files that arrive every day.

Specifically, for each file, I will have 3 daily versions and the nomenclature is the following: file_202401070900, file_202401071200, file_202401071500.

I have a workflow that runs this notebook but I need to schedule it 3 times a day and, for each build of the job, I will need to take the relative file.

For example, in the 0900 schedule I will need to take the file version file_202401070900, so I will insert 0900 as the time_prm parameter. For the file file_202401071200 I will insert 1200 etc.

Can I create multiple schedules of the same workflow and set a different time parameter for each schedule?

Do you have any other ideas to do it? It seems strange to me that I have to create 3 different workflows (overall I have many different files), so I would to generalize my work.

Thanks for your help!

Upvotes: 0

Views: 208

Answers (2)

Anupam Chand
Anupam Chand

Reputation: 2687

There is no need to create separate workflows. You can achieve this in the same workflow using a single task as shown. Set your workflow to trigger at the 3 required times i.e. 9:00, 12:00 and 15:00.

Use 4 parms with the following parameters

file_prefix : file_{{job.start_time.year}}
month : {{job.start_time.month}}
hour  : {{job.start_time.hour}}
day   : {{job.start_time.day}}

I've added these parms as task parameters but you may choose to add them as job parameters instead. enter image description here In your notebook, you need to format the month, hour and day to add a leading zero in case needed as this is not added automatically.

dbutils.widgets.text("file_prefix","")
dbutils.widgets.text("month","")
dbutils.widgets.text("hour","")
dbutils.widgets.text("day","")

file_prefix = dbutils.widgets.get("file_prefix")
month = f"{int(dbutils.widgets.get('month')):02d}"
hour = f"{int(dbutils.widgets.get('hour')):02d}"
day =  f"{int(dbutils.widgets.get('day')):02d}"
print(file_prefix)
print(month)
print(hour)
print(day)

The output of which is

01
01
08

Then you can construct your file name within the notebook as shown

Filename = file_prefix + month + day + hour + "00"
print(Filename)

Which gives the output

file_202501080100

Just keep in mind that the times are in UTC, If your file timestamp is of a different time zone, you will need to either convert UTC to that timezone to get the correct value or set the scheduling in UTC to match that of the local time on the file name.

You can now use this same code and same workflow for multiple files. All you need to do is change the scheduling to match the time on the file.

Upvotes: 0

Kashyap
Kashyap

Reputation: 17534

Two options:

  1. Single workflow/job:
  • Schedule the job to run at 9, 12 and 15 every day (e.g. 0 9-15/3 * * *).
  • No time_prm param given.
  • Write your code such that it rounds off current time to nearest-previous-3-hour-boundary and uses that as the suffix. E.g. if current time is 12:06pm then round off to 12:00 and process file_202401071200 (i.e. time_prm = 1200).

Little gotcha is automated retries or ad-hoc reprocessing etc. E.g. if 1200 job failed due to whatever reason (say file arrived late), then to support such a case (i.e. process late files), you'll probably need to support some additional params (like filename) for your job.

  1. Three workflows/jobs:
  • Schedule 3 separate jobs to run at 9, 12 and 15 every day (e.g. 0 9 * * *, 0 12 * * *, 0 15 * * *).
  • Hard code time_prm param to be 0900, 1200 and 1500 respectively.
  • Process file_20240107XXXX based on time_prm.

Late file processing or ad-hoc reprocessing would be simpler in this case so long as you're doing it on same day, compared to option 1.

I would personally go with #1, but WFYB.

Upvotes: 0

Related Questions