Reputation: 1263
Background
I am looking to execute a bunch of hive queries (around about 20-30 queries, and growing in number). Some of these queries depend on the result of few others, whereas some of them can be executed in parallel. (DAG)
Question
Is there a workflow manager which can take care of building a DAG (given the bunch of queries as input) and executing these queries parallely/sequentially (in the most optimium manner).
What are the best practices for the same.
Upvotes: 0
Views: 601
Reputation: 38325
Also this can be easily implemented in shell script You can start parallel processes, wait for them, then start other processes. Ampersand at the end of command instructs shell to run background process. See this example:
#!/bin/bash
LOG_DIR=/tmp/my_log_dir
#Set fail counter before parallel processes
FAIL=0
echo "Parallel loading 1, 2 and 3..."
hive -hiveconf "some_var"="$some_value" -f myscript_1.hql 2>&1 | tee $LOG_DIR/myscript_1.log &
hive -hiveconf "some_var"="$some_value" -f myscript_2.hql 2>&1 | tee $LOG_DIR/myscript_2.log &
hive -hiveconf "some_var"="$some_value" -f myscript_3.hql 2>&1 | tee $LOG_DIR/myscript_3.log &
#Wait for three processes to finish
for job in `jobs -p`
do
echo $job
wait $job || let "FAIL+=1"
done
#Exit if some process has failed
if [ "$FAIL" != "0" ];
then
echo "Failed processes=($FAIL) Giving up..."
exit 1
fi
#Set fail counter before parallel processes
FAIL=0
echo "Continue with next parallel steps 4,5..."
hive -hiveconf "some_var"="$some_value" -f myscript_4.hql 2>&1 | tee $LOG_DIR/myscript_4.log &
#and so on
Also there are other ways to run background processes: https://www.codeword.xyz/2015/09/02/three-ways-to-script-processes-in-parallel/
Upvotes: 1
Reputation: 533
You can use any tool for workflow management.Best practice depends on use case and expertise wise.
Traditionally in corporate :- Control-M or cron scheduler can be used.
From Big data ecosystem: oozie or azkaban
There are several other tools out there which can be used for workflow management.
Upvotes: 1