How can I improve the performance when completing a table with statistical methods in Apache-Spark?

Question

I have a dataset with 10 field and 5000 rows. I want to complete this dataset with some statistical methods in Spark with Scala. I filled the empty cells in a field with the mean value of that field, if it consists of continuous values and I put most frequent value in the field, if it consists of discrete values. Here is my code:

for(col <- cols){

  val datacount = table.select(col).rdd.map(r => r(0)).filter(_ == null).count()      

  if(datacount > 0)
  {      
    if (continuous_lst contains col)               // put mean of data to null values
    {             
      var avg = table.select(mean(col)).first()(0).asInstanceOf[Double]    
      df = df.na.fill(avg, Seq(col))             
    }

    else if(discrete_lst contains col)            // put most frequent categorical value to null values
    {
      val group_df = df.groupBy(col).count()  
      val sorted = group_df.orderBy(desc("count")).take(1)

      val most_frequent = sorted.map(t => t(0))
      val most_frequent_ = most_frequent(0).toString.toDouble.toInt

      val type__ = ctype.filter(t => t._1 == col)
      val type_ = type__.map(t => t._2)

      df = df.na.fill(most_frequent_, Seq(col))  
      }

    }
  }

The problem is that this code works very slowly with this data. I use spark-submit with executor memory 8G parameter. And I use repartition(4) parameter before sending the data to this function.

I should work bigger sized datasets. So how can I speed up this code?

Thanks for your help.

How can I improve the performance when completing a table with statistical methods in Apache-Spark?

Answers (1)

Related Questions