Reputation: 1706
I have a pyspark code which run on a edge node of an Hadoop cluster. This pyspark code perform various steps from feature engineering to ML training and prediction. The code is on github and I can be pull it on the edge node. The code can be submited using spark-submit in yarn/client or yarn/cluster mode. so far so good.
Now I would like to schedule some of these tasks regulary:
My question is how to deploy the code in a clean/easy way on the Haddop cluster every time I dod some modification so it can be schedule with Oozie (I guess Oozie is the best option for scheduling since it is already installed)
What is the best practice for such task ? (this is Data Science/ML code so we have our own Hadoop cluster which is separated from the one for Production (data ingestion, data processing with with scala ...)
Upvotes: 0
Views: 1319
Reputation: 191973
Oozie needs to run the JAR from HDFS.
You can follow an SCP or git pull + package command with an hdfs put.
If using Maven, you can try finding the Maven Wagon SSH or Maven Exec plugins and bind them to the deploy phase of your Maven lifecycle. Then mvn deploy
will run the necessary commands to put code on the edge node and HDFS. This is essentially the same task your CD engine would need to do, although you would need a Hadoop client configured to your Hadoop cluster on that CI/CD server to run hdfs commands.
If you setup an Oozie coordinator, and simply replace your JAR or Oozie job property files on HDFS, there's no need to edit any of the other Oozie settings within your deployment cycle.
Plus, Oozie has a REST API if you want to attempt to restart/kill running tasks programmatically
Upvotes: 1