Matt
Matt

Reputation: 3353

Running scheduled Spark job

I have a Spark job which reads a source table, does a number of map / flatten / reduce operations and then stores the results into a separate table we use for reporting. Currently this job is run manually using the spark-submit script. I want to schedule it to run every night so the results are pre-populated for the start of the day. Do I:

  1. Set up a cron job to call the spark-submit script?
  2. Add scheduling into my job class, so that it is submitted once but performs the actions every night?
  3. Is there a built-in mechanism in Spark or a separate script that will help me do this?

We are running Spark in Standalone mode.

Any suggestions appreciated!

Upvotes: 39

Views: 36679

Answers (6)

Shiva Garg
Shiva Garg

Reputation: 916

Recommended Schedulers :

  • Airbnb airflow
  • Apache Oozie
  • Apache Nifi
  • Cron job (least recommended)

Upvotes: 2

akshat thakar
akshat thakar

Reputation: 1527

You can use Rundeck to schedule jobs with decent UI screens to manage job failures and notification.

Upvotes: 1

Krishna Kalyan
Krishna Kalyan

Reputation: 1702

The most standard scheduler that comes with all the distributions of Apache Hadoop is Oozie.

https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html

In my experience initially its is little hard to work with the XML once you get a hang of it, It works like a charm.

Upvotes: 3

Ofer Eliassaf
Ofer Eliassaf

Reputation: 2958

Crontab is good enough only if you don't care about high availability, since it will run on a single machine that can fail.

The fact that you run in a stand alone mode indicate that you don't have hadoop and mesos installed, that have some tools to make this task more reliable.

An alternative to crontab (though it suffers from high availability issues as well at the moment) is airbnb's airflow. It was built for such use cases exactly (among others) see here: http://airflow.incubator.apache.org/scheduler.html.

Mesos users can try using chronos which is a cron job for clusters: https://github.com/mesos/chronos.

There is also oozie that comes from the hadoop world http://blog.cloudera.com/blog/2013/01/how-to-schedule-recurring-hadoop-jobs-with-apache-oozie/.

If this is a mission critical, you can even program it yourself if you use consul/zookeper or other tools that provide leader election - just have your processes run on multiple machines, have them compete on leadership and make sure the leader submits the job to the spark.

You can use spark job server to make the job submission more elegant: https://github.com/spark-jobserver/spark-jobserver

Upvotes: 5

ben jarman
ben jarman

Reputation: 1138

You can use a cron tab, but really as you start having spark jobs that depend on other spark jobs i would recommend pinball for coordination. https://github.com/pinterest/pinball

To get a simple crontab working I would create wrapper script such as

#!/bin/bash
cd /locm/spark_jobs

export SPARK_HOME=/usr/hdp/2.2.0.0-2041/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_USER_NAME=hdfs
export HADOOP_GROUP=hdfs

#export SPARK_CLASSPATH=$SPARK_CLASSPATH:/locm/spark_jobs/configs/*

CLASS=$1
MASTER=$2
ARGS=$3
CLASS_ARGS=$4
echo "Running $CLASS With Master: $MASTER With Args: $ARGS And Class Args: $CLASS_ARGS"

$SPARK_HOME/bin/spark-submit --class $CLASS --master $MASTER --num-executors 4 --executor-cores 4 $ARGS spark-jobs-assembly*.jar $CLASS_ARGS >> /locm/spark_jobs/logs/$CLASS.log 2>&1

Then create a crontab by

  1. crontab -e
  2. Insert 30 1 * * * /PATH/TO/SCRIPT.sh $CLASS "yarn-client"

Upvotes: 13

Shayan Masood
Shayan Masood

Reputation: 1057

There is no built-in mechanism in Spark that will help. A cron job seems reasonable for your case. If you find yourself continuously adding dependencies to the scheduled job, try Azkaban.

Upvotes: 11

Related Questions