Reputation: 1030
I am new with Spark and trying to understand performance difference in below approaches (Spark on hadoop)
Scenario : As per batch processing I have 50 hive queries to run.Some can run parallel and some sequential.
- First approach
All of queries can be stored in a hive table and I can write a Spark driver to read all queries at once and run all queries in parallel ( with HiveContext) using java multi-threading
- Second approach
using oozie spark actions run each query individual
I couldn't find any document about the first approach that how Spark will process queries internally in first approach.From performance point of view which approach is better ?
The only thing on Spark multithreading I could found is: "within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads"
Thanks in advance
Upvotes: 0
Views: 2403
Reputation: 11577
Since your requirement is to run hive queries in parallel with the condition
Some can run parallel and some sequential
This kinda of workflows are best handled by a DAG processor which Apache Oozie is. This approach will be clearner than you managing your queries by code i.e. you will be building your own DAG processor instead of using the one provided by oozie.
Upvotes: 0