improvement of a snippet of scala/spark code to improve performance

Question

I am trying to calculate the mean of a certain column and save it as a new column, following is my code snippet of achieving this:

df = df.withColumn("avg_colname", lit(df.select(avg("colname").as("temp")).first().getAs("temp")))

In total, there are 8 columns to be calculated. On a small 3-node cluster using the "spark-submit" command, the code execution takes much more time than on a single machine using the "spark-shell" command(several minutes vs. a few seconds).

Why does the code execute on a cluster slower than on a single machine, and how can the code snippet above be improved?

m_vemuri · Accepted Answer

I agree with Raphael on keeping it to 1 machine and 1 partition if the data is small. In addition, I wanted to add my answer to give you some more insight into what is happening under the hood.

The code you've asked about has the explain plan as shown below (note: i'm only showing the calculation of the average, the addition of the column is a lazy transformation, so it won't trigger an actual computation):

scala> df.select(avg("colname").as("temp")).explain

== Physical Plan ==
*HashAggregate(keys=[], functions=[avg(cast(col_2#6 as bigint))])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_avg(cast(col_2#6 as bigint))])
      +- LocalTableScan [col_2#6]

As shown in the code snippet above, there is an exchange that will take place for calculating the average value.

This answers the question of why you see a slow down when going to a 3 node cluster. The exchange results in a network shuffle of data across different partitions that causes the slow-down (network operations are much slower than in-memory or cache operations).

To answer your second question on how to improve it, will depend on the details of your application. You can't really change much in the code, since you still need to calculate the average. But here are two things you can look at:

Try filtering down your data if possible. Less data to be shuffled, results in less shuffle time. The filter will be pushed down, so it will execute before the shuffle.
Look into a possible skew in your cluster. This can be found by looking at the spark ui. A skew due to a slow node or network, can result in the whole application slowing down. In other words, your application will be as fast as the slowest node.

I would recommend 1 above. It is generally low hanging fruit in such situations and will be much simpler to achieve if you know your data and its use case.

improvement of a snippet of scala/spark code to improve performance

Answers (2)

Related Questions