Reputation: 153
I have a requirement to run multiple mapreduce jobs based on different set of files that hit the same table. I was exploring Oozie but I am not aware of Oozie completely.
My requirement is like
1. To run jobs based on time bound (and/or) File bound.
2. If certain files are not available, then it should skip the step.
3. User should be able to configure what steps and what priority each step should be.
Can any one suggest if Oozie fits my requirements? If so, How can I accomplish?
If not, is there any free or commercial tool similiar to Visual Cron that we intend to replace to run map reduce and java based jobs?
Upvotes: 2
Views: 593
Reputation: 9067
Quoting "Oozie Coord Use Cases" (from guys who actually used Oozie before it was Open Source - and are still the biggest users by far)
Here are some typical use cases for the Oozie Coordinator Engine.
- You want to run your workflow once a day at 2PM (similar to a CRON).
- You want to run your workflow every hour and you also want to wait for specific data feeds to be available on HDFS
- You want to run a workflow that depends on other workflows.
Continues with a tutorial.
And, BTW, the latest release of Oozie is V4.2 => documentation for Coordinator
Upvotes: 1
Reputation: 13402
Basically you want to run a oozie workflow for bunch MR jobs based on data availability at scheduled time of the day. You need to define Decision
node for checking the data existence and mapreduce
action for running the mapreduce job. You can define mail notification feature as well for job failure. You can find the detailed information here MapReduce Node, Decision Node, Oozie Actions Documentation. I have given a sample decision
node and mapreduce
node along with job.properties
file. Here is the command to run the oozie workflow. You can schedule it as cron for running it daily at given time.
oozie job -config job.properties -D param1=value -run
<workflow-app xmlns="uri:oozie:workflow:0.4" name="${app_name}">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
</global>
<start to="data1_check"/>
<decision name="data1_check">
<switch>
<case to="data1_job">${fs:exists(input-data)}</case>
<default to="data2_check"/>
</switch>
</decision>
<action name='data1_job'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
</prepare>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to="data2_check"/>
<error to="data2_check"/>
</action>
###Here we are going to data2_check decision node for both failure and success.
Because you want to run the next data job to run. You can stop the work flow by sending it to kill node failure.
###Your Last MR action will go to 'kill' node for failure and 'end' node for success.
<kill name="fail">
<message>Errormessage[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
</workflow-app>
Here is the job.properties
file.
nameNode=hdfs://localhost:9000 # or use a remote-server url. eg: hdfs://abc.xyz.yahoo.com:8020
jobTracker=localhost:9001 # or use a remote-server url. eg: abc.xyz.yahoo.com:50300
queueName=default
examplesRoot=map-reduce
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}
inputDir=input-data
outputDir=map-reduce
Upvotes: 1