How to deploy code from an edge node to an hadoop cluster to schedule it with Oozie?

Question

I have a pyspark code which run on a edge node of an Hadoop cluster. This pyspark code perform various steps from feature engineering to ML training and prediction. The code is on github and I can be pull it on the edge node. The code can be submited using spark-submit in yarn/client or yarn/cluster mode. so far so good.

Now I would like to schedule some of these tasks regulary:

I have some restriction on the edge node and I cannot use crontab
probably the best option is to use Oozie to submit job.

My question is how to deploy the code in a clean/easy way on the Haddop cluster every time I dod some modification so it can be schedule with Oozie (I guess Oozie is the best option for scheduling since it is already installed)

I can pull the code from github on the edge note and then copy and overwite de files on hdfs
CI/CD is not yet in place and is planned to be used for the production Hadoop cluster.
Any other solution ?

What is the best practice for such task ? (this is Data Science/ML code so we have our own Hadoop cluster which is separated from the one for Production (data ingestion, data processing with with scala ...)

OneCricketeer · Accepted Answer

Oozie needs to run the JAR from HDFS.

You can follow an SCP or git pull + package command with an hdfs put.

If using Maven, you can try finding the Maven Wagon SSH or Maven Exec plugins and bind them to the deploy phase of your Maven lifecycle. Then mvn deploy will run the necessary commands to put code on the edge node and HDFS. This is essentially the same task your CD engine would need to do, although you would need a Hadoop client configured to your Hadoop cluster on that CI/CD server to run hdfs commands.

If you setup an Oozie coordinator, and simply replace your JAR or Oozie job property files on HDFS, there's no need to edit any of the other Oozie settings within your deployment cycle.

Plus, Oozie has a REST API if you want to attempt to restart/kill running tasks programmatically

How to deploy code from an edge node to an hadoop cluster to schedule it with Oozie?

Answers (1)

Related Questions