Francis Drumm
Francis Drumm

Reputation: 87

Breaking SQL query up to improve Spark efficiency

I have a really large HIVE SQL query which is to be migrated over to spark.
Is there performance savings to be made by dividing up the query?

Combined:

Dataset<Row> sqlDF = spark.sql("select c.name from order o join customer c on o.orderID=c.orderID where o.productPrice > 100");

Compared to:

Dataset<Row> order = spark.sql("select o.orderID from order where where o.productPrice > 100");   
Dataset<Row> customer= spark.sql("select c.orderID, c.name from customer);    
Dataset<Row> joinedTable = order.join(customer, customer("orderID"));

Using Dataframes will reduce the readability but I'm not sure how spark optimizes the request.

Upvotes: 3

Views: 209

Answers (1)

Ged
Ged

Reputation: 18003

Given that a non-completing query also exhibits lack of performance, then the answer is yes.

I have seen Spark SQL with a 70 table join that failed with memory errors. Breaking it down into smallers sets with caching allowed it all to subsequently execute.

Catalyst is getting better but is behind imho with the best of breed Optimizers for traditional RDBMSs. It or its successors will catch up.

The same issues apply to Oracle as well.

I think your example will not show any advantages, but more complexer queries will at times need to be split, but that is on a case by case basis.

Upvotes: 2

Related Questions