Reputation: 347
I'm in Databricks, where I have a python notebook that takes a file from the landing, processes it and saves a delta table.
This notebook contains a parameter (time_prm) that allows me to do this option for the different versions of files that arrive every day.
Specifically, for each file, I will have 3 daily versions and the nomenclature is the following: file_202401070900, file_202401071200, file_202401071500.
I have a workflow that runs this notebook but I need to schedule it 3 times a day and, for each build of the job, I will need to take the relative file.
For example, in the 0900 schedule I will need to take the file version file_202401070900, so I will insert 0900 as the time_prm parameter. For the file file_202401071200 I will insert 1200 etc.
Can I create multiple schedules of the same workflow and set a different time parameter for each schedule?
Do you have any other ideas to do it? It seems strange to me that I have to create 3 different workflows (overall I have many different files), so I would to generalize my work.
Thanks for your help!
Upvotes: 0
Views: 208
Reputation: 2687
There is no need to create separate workflows. You can achieve this in the same workflow using a single task as shown. Set your workflow to trigger at the 3 required times i.e. 9:00, 12:00 and 15:00.
Use 4 parms with the following parameters
file_prefix : file_{{job.start_time.year}}
month : {{job.start_time.month}}
hour : {{job.start_time.hour}}
day : {{job.start_time.day}}
I've added these parms as task parameters but you may choose to add them as job parameters instead.
In your notebook, you need to format the month, hour and day to add a leading zero in case needed as this is not added automatically.
dbutils.widgets.text("file_prefix","")
dbutils.widgets.text("month","")
dbutils.widgets.text("hour","")
dbutils.widgets.text("day","")
file_prefix = dbutils.widgets.get("file_prefix")
month = f"{int(dbutils.widgets.get('month')):02d}"
hour = f"{int(dbutils.widgets.get('hour')):02d}"
day = f"{int(dbutils.widgets.get('day')):02d}"
print(file_prefix)
print(month)
print(hour)
print(day)
The output of which is
01
01
08
Then you can construct your file name within the notebook as shown
Filename = file_prefix + month + day + hour + "00"
print(Filename)
Which gives the output
file_202501080100
Just keep in mind that the times are in UTC, If your file timestamp is of a different time zone, you will need to either convert UTC to that timezone to get the correct value or set the scheduling in UTC to match that of the local time on the file name.
You can now use this same code and same workflow for multiple files. All you need to do is change the scheduling to match the time on the file.
Upvotes: 0
Reputation: 17534
Two options:
0 9-15/3 * * *
).time_prm
param given.12:06pm
then round off to 12:00
and process file_202401071200
(i.e. time_prm
= 1200
).Little gotcha is automated retries or ad-hoc reprocessing etc. E.g. if 1200 job failed due to whatever reason (say file arrived late), then to support such a case (i.e. process late files), you'll probably need to support some additional params (like filename) for your job.
0 9 * * *
, 0 12 * * *
, 0 15 * * *
).time_prm
param to be 0900
, 1200
and 1500
respectively.file_20240107XXXX
based on time_prm
.Late file processing or ad-hoc reprocessing would be simpler in this case so long as you're doing it on same day, compared to option 1.
I would personally go with #1, but WFYB.
Upvotes: 0