Reputation: 350
Here's the situation : I've got a DataFrame where I want to get the column names of the columns that contains one or more null values in them.
So far what I've done :
val columnsContainsNull = df.select(mandatoryColumns: _*)
.columns
.map(column => if(df.select(column).filter(col(column).isNull).count > 0) column)
When I execute that code, it becomes incredibly slow for a reason I don't know. Do you have any clue how can I make it work and how can I optimize that please ?
Upvotes: 2
Views: 1262
Reputation: 10082
In your code, you will executing one action for every column, which can cause the execution to slow down and even more so in case of wider data.
You can calculate using the following method
Register your DF as a table using createOrReplaceTempView
df.createOrReplaceTempView("TEMP_TABLE")
Then execute the following SQL statement on it.
SELECT
SUM(case when col1 is null then 1 else 0 end) as col1_isnull,
SUM(case when col2 is null then 1 else 0 end) as col2_isnull,
....
....
SUM(case when coln is null then 1 else 0 end) as coln_isnull
from
TEMP_TABLE
If you have a lot of columns, you can also generate the statement programatically using:
val query = df2.columns.map{ c =>
s"sum(case when $c is null then 1 else 0 end) as ${c}"
}.mkString("SELECT ", ",", " FROM TEMP_TABLE")
Then once you have executed the query using
val nullCounts = spark.sql(query)
You should have a dataframe that looks like:
+----+----+----+----+
|col1|col2|....|colN|
+----+----+----+----+
| 0| 1| | 1|
+----+----+----+----+
Then, you can extract the column names that have null values using the following:
val paired = nullCounts.first
nullCounts.columns
.zipWithIndex.map{
case(k,v) => (k, paired.getLong(v))
}
.filter(_._2 > 0 )
.map(_._1)
And you should have an Array[String]
which will have the column names that have NULL
value in them.
// Array[String] = Array(col2)
Upvotes: 4