How can I optimize the spark function to replace nulls with zeroes?

Question

Below is my Spark Function which handles nulls in a DataFrame column irrespective of its datatype.

  def nullsToZero(df:DataFrame,nullsToZeroColsList:Array[String]): DataFrame ={
    var y:DataFrame = df
    for(colDF <- y.columns){
      if(nullsToZeroColsList.contains(colDF)){
        y = y.withColumn(colDF,expr("case when "+colDF+" IS NULL THEN 0 ELSE "+colDF+" end"))
      }
    }
    return y
  }

    import spark.implicits._
    val personDF = Seq(
      ("miguel", Some(12),100,110,120), (null, Some(22),200,210,220), ("blu", None,300,310,320)
    ).toDF("name", "age","number1","number2","number3")
    println("Print Schema")
    personDF.printSchema()
    println("Show Original DF")
    personDF.show(false)
    val myColsList:Array[String] = Array("name","age","age")
    println("NULLS TO ZERO")
    println("Show NullsToZeroDF")
    val fixedDF = nullsToZero(personDF,myColsList)

In the above code I've a Integer type and a String type datatypes, both were handled by my funciton. But I suspect the below piece of code, in my function might affect the performance but not sure.

y = y.withColumn(colDF,expr("case when "+colDF+" IS NULL THEN 0 ELSE "+colDF+" end"))

Is there any more optimized way I can write this function, and what is the significance of doing .withColumn() and reassigning a DF again and again? Thank you in Advance.

How can I optimize the spark function to replace nulls with zeroes?

Answers (1)

Related Questions