Apache Spark makes SQL query faster?

Question

From apache-spark-makes-slow-mysql-queries-10x-faster

For long running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes. In my examples below, MySQL queries are executed inside Spark and run 5-10 times faster (on top of the same MySQL data).

It looks great but i am not able to think practical example of query where query can be divided in subqueries and multiple cores van make it faster instead of running it on one core ?

rogue-one · Accepted Answer

Lets consider we have two tables Customers and Orders and each has 100 million records.

Now we have to join these two tables on the column customer_id in both Customer and Order table to generate a report, it is close to impossible to do it MySQL because a single system has to perform this join on a huge volume of data.

On a Spark Cluster we can repartition these tables based on the join column. The data of both the dataframes are distributed now by hashing the customer_id. so this means both the orders and customers table has all the data for a single customer in the same worker node of spark and it can be perform a local join as shown below in the snippet.

val customerDf = //
val orderDf = //
val df1 = customerDf.repartition($"customer_id")
val df2 = orderDf.repartition($"customer_id")
val result df1.join(df2).on(df1("customer_id") == df2("customer_id"))

So this 100 million record join is now performed in parallel across tens or hundreds of worker nodes as opposed to be being done in a single node as in the case of MySQL.

Apache Spark makes SQL query faster?

Answers (1)

Related Questions