Meyer Cohen
Meyer Cohen

Reputation: 350

Scala - Spark : Get the column names of the columns that contains null values

Here's the situation : I've got a DataFrame where I want to get the column names of the columns that contains one or more null values in them.

So far what I've done :

val columnsContainsNull = df.select(mandatoryColumns: _*)
   .columns
   .map(column => if(df.select(column).filter(col(column).isNull).count > 0) column)

When I execute that code, it becomes incredibly slow for a reason I don't know. Do you have any clue how can I make it work and how can I optimize that please ?

Upvotes: 2

Views: 1262

Answers (1)

philantrovert
philantrovert

Reputation: 10082

In your code, you will executing one action for every column, which can cause the execution to slow down and even more so in case of wider data.

You can calculate using the following method

Register your DF as a table using createOrReplaceTempView

df.createOrReplaceTempView("TEMP_TABLE")

Then execute the following SQL statement on it.

SELECT
  SUM(case when col1 is null then 1 else 0 end) as col1_isnull,
  SUM(case when col2 is null then 1 else 0 end) as col2_isnull,
  ....
  ....
  SUM(case when coln is null then 1 else 0 end) as coln_isnull
from
  TEMP_TABLE

If you have a lot of columns, you can also generate the statement programatically using:

val query = df2.columns.map{ c => 
   s"sum(case when $c is null then 1 else 0 end) as ${c}"
}.mkString("SELECT ", ",", " FROM TEMP_TABLE")

Then once you have executed the query using

val nullCounts = spark.sql(query)

You should have a dataframe that looks like:

+----+----+----+----+
|col1|col2|....|colN|     
+----+----+----+----+
|   0|   1|    |   1|
+----+----+----+----+

Then, you can extract the column names that have null values using the following:

val paired = nullCounts.first
nullCounts.columns
          .zipWithIndex.map{ 
              case(k,v) => (k, paired.getLong(v)) 
          }
          .filter(_._2 > 0 )
          .map(_._1)

And you should have an Array[String] which will have the column names that have NULL value in them.

// Array[String] = Array(col2)

Upvotes: 4

Related Questions