mehere
mehere

Reputation: 1556

Spark Scala FoldLeft Performance slowness

Hi I am trying to do a scdtype2 update in dataframe having 280 columns.

val newYRecs = stgDF.columns
                .foldLeft(joinedDF)
                  {(tempDF,colName) => 
                      tempDF.withColumn("new_" + colName, when(col("stg." + colName).isNull, col("tgt."+ colName)).otherwise(col("stg."  + colName))).drop(col("stg." + colName)).drop(col("tgt." + colName)).withColumnRenamed("new_" + colName,colName) 

This is taking 8 minutes alone to execute. Is there any way this can be optimized?

Upvotes: 1

Views: 559

Answers (1)

krezno
krezno

Reputation: 500

According to this article, it seems that withColumn has a hidden cost of the Catalyst optimizer that hampers performance when used on many columns. I would try using the proposed workaround and doing something like this (Also while you're at it, you can make your code cleaner with coalesce):

val newYRecs = joinedDF.select(stgDF.columns.map{ colName =>
      coalesce(col("stg." + colName), col("tgt."+ colName)) as colName
}: _*)

Upvotes: 2

Related Questions