Reputation: 369
My current issue here is with trying to develop a set of Knime nodes that provide integration with Apache Oozie. Namely i'm trying to build, launch and monitor Oozie workflows from within Knime.
I've had some success with implementing this for linear Oozie workflows, but have become quite stumped when branching needs to be included.
As background, let me explain the way i did this for linear workflows:
Essentially my solution expresses each Oozie Action as a Knime Node. Each of these nodes has 2 modes of operation, the proper one being called based on the content of certain flow variables. These 2 modes are needed because i have to execute the Oozie portion (OozieStartAction to OozieStopAction) twice, the first iteration generating the Oozie workflow, and the second launching and monitoring it. Also, flow variables persist between iterations of this loop.
In one mode of operation, a node appends the xml content particular to the Oozie action it represents to the overall Oozie workflow xml and then forwards it.
In the other, the node simply polls Oozie for the status of the action it represents.
The following flow vars are used in this workflow:
-OOZIE_XML: contains oozie workflow xml
-OOZIE_JOB_ID: id of the running oozie job launched with the assembled workflow
-PREV_ACTION_NAME: Name of the previous action
In the example above, what would happen step by step is the following:
-OozieStartNode runs, sees it has a blank or no OOZIE_XML variable, so it creates one itself, setting the basic workflow-app and start xml nodes. It also creates a PREV_ACTION_NAME flow var with value "start".
-The first OozieGenericAction sees that it has a blank OOZIE_JOB_ID so it appends a new action to the workflow-app node in the received OOZIE_XML, gets the node with the "name" attribute equal to PREV_ACTION_NAME and sets its transition to the action it just created. PREV_ACTION_NAME is then overwritten with the current action's name.
...
-The StopOozieAction simply creates an end node and sets the previous action's transition to it, much like the previous generic action.
-In the second iteration, OozieStart sees it has XML data, so the secondary execution mode is called. This uploads the workflow XML into hdfs and creates a new Oozie job with this workflow, and forwards the received JobId as OOZIE_JOB_ID.
-The following Oozie Actions, having a valid OOZIE_JOB_ID, simply poll Oozie for their action names' status, ending execution once their respecive actions finish running
The main problem i'm facing is in the workflow xml assembly, as, for one, i can't use the prev node name variable while using branching. If i had a join action with many nodes linking to it, one prev node would overwrite the others and node relation data would be lost.
Does anybody have any broad ideas in which way i could take this ?
Upvotes: 1
Views: 483
Reputation: 321
How about using a variable to column where there's a column in the recursive loop called (Previous Acction Name). It might seem like overkill keeping the same value in one for all rows, but the recursive loop would pass it along just like any other column.
BTW, have you seen these? https://www.knime.org/knime-big-data-connectors
Upvotes: 0