Reputation: 1014
I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
Upvotes: 4
Views: 4707
Reputation: 11627
My assumption is you an application JAR
containing Java
/ Scala
code that you want to submit to remote Spark
cluster. Livy
is arguably the best option for remote spark-submit
when evaluated against other possibilities:
master
IP: Requires modifying global configurations / environment variablesSSHOperator
: SSH
connection might breakEmrAddStepsOperator
: Dependent on EMR
Regarding tracking
Livy
only reports state
and not progress (% completion of stages)Livy
server via REST
API and keep printing logs in console, those will appear on task logs in WebUI (View Logs
)Other considerations
Livy
doesn't support reusing SparkSession
for POST/batches
requestPySpark
and use POST/session
requestsReferences
livy/examples/pi_app
rssanders3/livy_spark_operator_python_example
Useful links
Upvotes: 3