Oozie coordinated workflow

Question

I have a requirement to run multiple mapreduce jobs based on different set of files that hit the same table. I was exploring Oozie but I am not aware of Oozie completely.

My requirement is like
1. To run jobs based on time bound (and/or) File bound.
2. If certain files are not available, then it should skip the step.
3. User should be able to configure what steps and what priority each step should be.

Can any one suggest if Oozie fits my requirements? If so, How can I accomplish?
If not, is there any free or commercial tool similiar to Visual Cron that we intend to replace to run map reduce and java based jobs?

YoungHobbit · Accepted Answer

Basically you want to run a oozie workflow for bunch MR jobs based on data availability at scheduled time of the day. You need to define Decision node for checking the data existence and mapreduce action for running the mapreduce job. You can define mail notification feature as well for job failure. You can find the detailed information here MapReduce Node, Decision Node, Oozie Actions Documentation. I have given a sample decision node and mapreduce node along with job.properties file. Here is the command to run the oozie workflow. You can schedule it as cron for running it daily at given time.

oozie job -config job.properties -D param1=value -run



    ${jobTracker}
    ${nameNode}
    
        
            mapred.job.queue.name
            ${queueName}
        
    





    
        ${fs:exists(input-data)}
        
    



    
        ${jobTracker}
        ${nameNode}
        
        
        
            
                mapred.mapper.class
                org.myorg.WordCount.Map
            
            
                mapred.reducer.class
                org.myorg.WordCount.Reduce
            
            
                mapred.input.dir
                ${inputDir}
            
            
                mapred.output.dir
                ${outputDir}
            
        
    
    
    

###Here we are going to data2_check decision node for both failure and success.
Because you want to run the next data job to run. You can stop the work flow by sending it to kill node failure.


###Your Last MR action will go to 'kill'  node for failure and 'end' node for success.

    Errormessage[${wf:errorMessage(wf:lastErrorNode())}]

Here is the job.properties file.

nameNode=hdfs://localhost:9000    # or use a remote-server url. eg: hdfs://abc.xyz.yahoo.com:8020
jobTracker=localhost:9001         # or use a remote-server url. eg: abc.xyz.yahoo.com:50300
queueName=default
examplesRoot=map-reduce

oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}
inputDir=input-data
outputDir=map-reduce

Oozie coordinated workflow

Answers (2)

Related Questions