Reputation: 589
I have a mapReduce task (https://github.com/flopezluis/testing-hadoop) that reads the files in a folder and it appends them to a zip. I need to run this task forever, so when it finishes to process them, it should run again. I'm reading about oozie but I'm not sure whether it's the best fit, because maybe it's too big for my problem.
In case oozie is the best solution. If I write a coordinator to run every 10 minutes, what happens if the task takes more than 10 minutes, the coordinator waits to run the task again?
Explanation of the task
The folder is always the same. There are differences zips files, one for key. The idea is to create the zip file step by step. I think this faster than create the zip file after all the files are procesed. The files contain something like this:
<info operationId="key1">
DATA1
</info>
<info operationId="key1">
DATA2
</info>
<info operationId="key2">
DATA3
</info>
So the zips will be like this:
key1.zip --> data1, data2
key3.zip --> data3
Thanks
Upvotes: 1
Views: 1173
Reputation: 411
If all you need is to execute the same hadoop job repeatedly on different input files, Oozie might be an overkill. Install and config Oozie on your testbed will also take some time. Writing a script that submits the hadoop job repeatedly might be enough.
But anyway, Oozie can do that. If you set the concurrency to 1, there will be at most 1 oozie coordinator action (which should be a workflow that contains only one hadoop job in your case) in running status. But you can increase the concurrency threshold to allow more actions executing concurrently.
Upvotes: 1
Reputation: 302
You can use oozie for this. Oozie has a setting that will tell limit how many instances of a job can be running at once. If you first job isn't finished after then minutes than it will wait to run the next job.
From the Oozie documentation:
6.1.6. Coordinator Action Execution Policies The execution policies for the actions of a coordinator job can be defined in the coordinator application. • Timeout: A coordinator job can specify the timeout for its coordinator actions, this is, how long the coordinator action will be in WAITING or READY status before giving up on its execution. • Concurrency: A coordinator job can specify the concurrency for its coordinator actions, this is, how many coordinator actions are allowed to run concurrently ( RUNNING status) before the coordinator engine starts throttling them. • Execution strategy: A coordinator job can specify the execution strategy of its coordinator actions when there is backlog of coordinator actions in the coordinator engine. The different execution strategies are 'oldest first', 'newest first' and 'last one only'. A backlog normally happens because of delayed input data, concurrency control or because manual re-runs of coordinator jobs.
Just wanted to also comment that you could have the coordination job triggered off data arrival with a DataSet, but I am not that familar with DataSets.
Upvotes: 3