How to find columns with many nulls in Spark/Scala

Question

I have a dataframe in Spark/Scala which has 100's of column. Many of the oth columns have many null values. I'd like to find the columns that have more than 90% nulls and then drop them from my dataframe. How can I do that in Spark/Scala?

emesday · Accepted Answer

org.apache.spark.sql.functions.array and udf will help.

import spark.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize[(String, String, String, String, String, String, String, String, String, String)](
  Seq(
    ("a", null, null, null, null, null, null, null, null, null), // 90%
    ("b", null, null, null, null, null, null, null, null, ""), // 80%
    ("c", null, null, null, null, null, null, null, "", "") // 70%
  )
).toDF("c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9","c10")

// count nulls then check the condition
val check_90_null = udf { xs: Seq[String] =>
  xs.count(_ == null) >= (xs.length * 0.9)
}

// all columns as array
val columns = array(df.columns.map(col): _*)

// filter out
df.where(not(check_90_null(columns)))
  .show()

shows

+---+----+----+----+----+----+----+----+----+---+
| c1|  c2|  c3|  c4|  c5|  c6|  c7|  c8|  c9|c10|
+---+----+----+----+----+----+----+----+----+---+
|  b|null|null|null|null|null|null|null|null|   |
|  b|null|null|null|null|null|null|null|    |   |
+---+----+----+----+----+----+----+----+----+---+

which the row started "a" is excluded.

How to find columns with many nulls in Spark/Scala

Answers (2)

Related Questions