Ramdev Sharma
Ramdev Sharma

Reputation: 1014

Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.

Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.

What is best way to track Spark job using Airflow if even I submitted?

Upvotes: 4

Views: 4707

Answers (1)

y2k-shubham
y2k-shubham

Reputation: 11627

My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:

  • Specifying remote master IP: Requires modifying global configurations / environment variables
  • Using SSHOperator: SSH connection might break
  • Using EmrAddStepsOperator: Dependent on EMR

Regarding tracking

  • Livy only reports state and not progress (% completion of stages)
  • If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)

Other considerations

  • Livy doesn't support reusing SparkSession for POST/batches request
  • If that's imperative, you'll have to write your application code in PySpark and use POST/session requests

References


Useful links

Upvotes: 3

Related Questions