Long linear queries in Spark against a graph stored in Hive tables

Question

Suppose I have a graph G and the following query:

     x     y     z     w     q     r    s
(?a)--(?b)--(?c)--(?d)--(?e)--(?f)--(?g)--(?h)

where {?a, ?b, ?c, ..., ?h} are variables, and {x, y, z, w, q, r, s} are arc labels.

At the storage level I have one table for each label but also for a combination of two labels. For instance, I might have a table x with columns |a|b|, but I also have a table xy with columns |a|b|c|. Yes, I have redundant tables.

Based on this setting I have two problems:

a) I need to find the tables such that the join between them leads to the best execution time (the smallest). Let {xy zw, q, rs} be those tables for the example above.

b) I have to execute the joins in a given order so I need to find that order, for instance: (rs ⨝ q) ⨝ (zw ⨝ xy) (⨝ is a natural join).

Assuming I know which tables to use, i.e. that I have solved a), my question is how to tackle the second one. The Spark API allows me to execute all joins in a single line:

val res1 = xy.join(zw, Seq("c")).join(q, Seq("e")).join(rs, Seq("f"))

but I can also execute it in several lines:

val tmp1 = xy.join(zw, Seq("c"))

val tmp2 = q.join(rs, Seq("f"))

val res2 = tmp1.join(tmp2, Seq("e"))

The execution times of res1.count and res2.count (avegare of several runs) is different in my experiments. The way how the tree is built seems to have an impact in the execution.

1) Which strategy can I use to build a tree which leads to the optimal execution time in Spark?

2) If each different tree seems to lead to a different performance, what is the role of the query optimizer wrt. the join ordering. It seems to be doing nothing, especially in the case where I have all joins in a single line of code:

val res1 = xy.join(zw, Seq("c")).join(q, Seq("e")).join(rs, Seq("f"))

and

val res3 = rs.join(q, Seq("f")).join(zw, Seq("e")).join(xy, Seq("c"))

In one case I could have a reasonable execution time. In the other a time out. Isn't Catalyst doing anything?

Jacek Laskowski · Accepted Answer

The Spark API allows me to execute all joins in a single line:

but I can also execute it in several lines:

Incorrect. There's no execution at this time, but only when you execute an action. What you showed are different ways of writing the same computation graph using high-level operators in Scala that create the same query plan.

1) Which strategy can I use to build a tree which leads to the optimal execution time in Spark?

That's the purpose of the so-called Catalyst Optimizer (not you). You may want to explore CostBasedJoinReorder logical optimization and JoinSelection execution planning strategy that are responsible for ensuring the best performance of joins.

JoinSelection execution planning strategy is used by SparkPlanner to plan a Join logical operator to one of the supported join physical operators.

CostBasedJoinReorder is a logical optimization for reordering joins in cost-based optimization.

If the size of tables matters, consider cost-based optimization (CBO). You should see a difference. You have to use tables (not any relations) and execute ANALYZE TABLE COMPUTE STATISTICS command for the stats.

Isn't Catalyst doing anything?

It should optimize joins. Explain query plans for more details.

Long linear queries in Spark against a graph stored in Hive tables

Answers (1)

Related Questions