Reputation: 21
I'm using Luigi for dependency resolution and it's working great.
Should I also use Luigi for "soft" dependencies?
Let me explain. Suppose my current tasks download and process data for a bunch of dates. Suppose that after that, I want to run one script that goes thru all the data and output a simple summary. Not one summary per date, but one summary for all the data that has been downloaded.
I call this a soft dependency, because I would like my final script to run on the data for all dates, but if a few dates fail to download, I would still like the script to run for the other dates.
How should I organize tasks for this use case, or is this not a job for Luigi?
Upvotes: 2
Views: 283
Reputation: 345
I wrote the luigi_soft_failures package for this exact use case. It works along the lines described by the other two replies and has some extra options like checking for soft failure of dependencies, propagating soft failures from dependencies, and automatically catching errors.
Upvotes: 0
Reputation: 16
you can do that in many ways.
pros and cons:
is the good for scaling because you can do work in parallel for those files. As long the "checking" of the file doesn't take too long. It is a no no to make heavy lifting inside of requires, because it will just delay for the whole dependency graph to build. so if you have a big graph, this is not good
is the simplest to implement, because all processing is in 1 task and it makes easier to output that 1 summary file.
is my favorite. You dont do any heavy lifting inside requires. and it scales to big graphs and you can do a lot of processing in parallel. The one big con here is that it might lead to silent bugs in those soft fails. if you can nail that down, i think you have a winner
Upvotes: 0
Reputation: 117
You can achieve this using traditional Luigi only. I can explain to you the scenario on how to implement this.
Few of these won't work on windows, so if you are running Luigi on Linux, you can follow the steps
You need to change the Luigi configuration. It is found as luigi.cfg file.
[worker]
keep-alive = True
[scheduler]
retry_delay = 10.0 # this is the time the task has to retry. default is 900 seconds.
retry_count = 3 # how many times the scheduler has to retry before failing it permanently.
retry_delay value is a float. This informs the scheduler to prepare the failed task again and change the status from failed to pending. After 10 seconds, it will be executed again. You can change this value as you please. Once you are done with these changes, restart Luigi
Below is the code snippet. I will write an explanation on how to use it
class FileProcessor(luigi.Task):
id = luigi.Parameter(default=uuid.uuid4()) # some parameter. Add other parameters as needed here
counter = 1 # this is important
#strucuture your output method if you don't need to spit any output in the task. This ensures task completion, without the use of a local target. Local target creates lot of files, which are cumbersome to manage later
def output(self):
return luigi.mock.MockTarget(f'FileProcessor-{id}')
def run(self):
if self.counter == RETRY_COUNT: #use your retry count here
log.info(f'Number of retries for this task exceeded {RETRY_COUNT}. Hence soft-failing the task')
self.output().open('w').close() # this informs the scheduler that the task is complete even though it is not. This is soft failing.
else:
log.info(f'Retry count for {self.task_name}is {self.counter}') #just a log for tracking the number of times the task was executed before failing. This is particularly useful when we are hitting APIs with timeout
So, for your problem, you can use either yield (dynamic dependencies) to create tasks based on the dates. All the tasks created based on the dates will complete and then the consolidation(summary) will be executed. Even if they fail for one reason or the other, you can get the summary executed by using soft fail.
Upvotes: 0