Ozgen
Ozgen

Reputation: 1092

Process Rows Separately in Oozie

I have a simple input file with 2 columns like

pkg1 date1
pkg2 date2
pkg3 date3
...
...

I want to create a oozie workflow which will process each row separately . For each row, I want to run multiple Actions one after another(Hive,Pig..) and then process another row. But it is more difficult than I expected. I think, I have to create a loop somehow and iterate through it.

Can you give me architectural advise how I can achieve this?

Upvotes: 0

Views: 62

Answers (2)

Samson Scharfrichter
Samson Scharfrichter

Reputation: 9067

I totally agree with @Mattinbits, you must use some procedural code (shell script, Python, etc) to run the loop and fire the appropriate Pig/Hive tasks.

But if your process must wait for the tasks to complete before launching the next batch, the coordination part might become a bit more complicated to implement. I can think of a very evil way to use Oozie for that coordination...

  • write down a generic Oozie Workflow that runs the Pig/Hive actions for 1 set of parameters, passed as properties
  • write down a "master template" Oozie workflow that just runs the WF above as a sub-workflow with dummy values for the properties
  • cut the template in 3 parts : XML header, sub-workflow call (with placeholders for actual values of properties) and XML footer
  • your loop will then build the actual "master" Workflow dynamically, by concatenating the header, a call to the sub-workflow for 1st set of values, another call for 2nd set, etc etc, then the footer -- and finally submit the Workflow to Oozie server (using REST or command line interface)

Of course there are some other things to take care of -- generating unique names for sub-workflows Actions, chaining them, handling errors. The usual stuff.

Upvotes: 1

mattinbits
mattinbits

Reputation: 10428

Oozie does not support loops/cycles, since it is a Directed Acyclic Graph

https://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a2.1_Cycles_in_Workflow_Definitions

Also, there is no inbuilt way (that I'm aware of) to read data from Hive into an Oozie workflow and use it to control the flow of the Oozie workflow.

You could have a single Oozie workflow which launches some custom process (e.g. a Shell Action), and within that process read the data from Hive, and launch a new, separate, Oozie workflow for each entry.

Upvotes: 1

Related Questions