Reputation: 900
My sensor data is captured in hive tables and I want run spark jobs on those in regular intervals of time. Lets say 15 mins 30 mins 45 minute jobs.
We are using cron scheduler to schedule the jobs(different spark-submits) in fixed intervals of time. The problem here is due to yarn resource contention issues jobs are running slow and cron is continuously triggering the same jobs again and again.
For example : 30 mins jobs were triggered and it was delayed due to some cluster resource issues, cron is triggering another 30 mins job for every 30 minutes.
May be one way is to address this is using quarz/oozie scheduler actions.
Is there any programmatic approach to ensure that one job with the same job name completes then only next job with the same name should trigger?
What is the best way to schedule them ?
Upvotes: 1
Views: 718
Reputation: 29195
Option 1 : You can use Airflow as scheduler and create dependency between jobs.
Option 2:Apache Spark job using CRONTAB in Unix- prevents duplicate job submissions
#!/bin/bash
LOCKFILE=/filelock.pid
SPARK_PROGRAM_CLASS=com.javachain.javachainfeed
#SPARK_PROGRAM_JAR=javachain_family-assembly-5.0.jar
#HIVE_TBALE=javachain_prd_tbls.Family_data
#FEEDNAME=""
#Process Locking
if [ -f ${LOCKFILE} ] ; then
PID=`cat ${LOCKFILE}`
ps -fp${PID} > /dev/null
STAT=$?
if [ "${STAT}" = "0" ]; then
echolog "Already running as pid ${PID}"
exit 0
fi
[ -z "$DEBUGME" ] || echolog "${LOCKFILE} exists but contains PID: ${PID} of a prior process"
else
[ -z "$DEBUGME" ] || echolog "${LOCKFILE} does not exist, will create one"
fi
echo $$ > ${LOCKFILE}
while read -r line
do
set -- $line
FEEDNAME=$1
spark-submit --master yarn-client --driver-memory 10G --executor-memory 8G --num-executors 30
--class $SPARK_PROGRAM_CLASS $SPARK_PROGRAM_JAR --hiveTable $HIVE_TBALE --className $FAMILY
done < "familynames.txt"
Also,
To give fair access to execute spark jobs in cluster I would offer to configure hadoop yarn fair(NOT FIFO) scheduler
Upvotes: 0