Find all nulls with SQL query over pyspark dataframe

Question

I have a dataframe of StructField with a mixed schema (DoubleType, StringType, LongType, etc.).

I want to 'iterate' over all columns to return summary statistics. For instance:

set_min = df.select([
        fn.min(self.df[c]).alias(c) for c in self.df.columns
    ]).collect()

Is what I'm using to find the minimum value in each column. That works fine. But when I try something designed similar to find Nulls:

set_null = df.filter(
       (lambda x: self.df[x]).isNull().count()
).collect()

I get the TypeError: condition should be string or Column which makes sense, I'm passing a function.

or with list comprehension:

set_null = self.df[c].alias(c).isNull() for c in self.df.columns

Then I try pass it a SQL query as a string:

set_null = df.filter('SELECT fields FROM table WHERE column = NUL').collect()

I get:

ParseException: "
mismatched input 'FROM' expecting (line 1, pos 14)

== SQL ==
SELECT fields FROM table WHERE column = NULL
--------------^^^
"

How can i pass my function as a 'string or column' so I can use filter or where alternatively, why wont the pure SQL statement work?

sgvd · Accepted Answer

There are things wrong in several parts of your attempts:

You are missing square brackets in your list comprehension example
You missed an L in NUL
Your pure SQL doesn't work, because filter/where expects a where clause, not a full SQL statement; they are just aliases and I prefer to use where so it is clearer you just need to give such a clause

In the end you don't need to use where, like karlson also shows. But subtracting from the total count means you have to evaluate the dataframe twice (which can be alleviated by caching, but still not ideal). There is a more direct way:

>>> df.select([fn.sum(fn.isnull(c).cast('int')).alias(c) for c in df.columns]).show()
+---+---+
|  A|  B|
+---+---+
|  2|  3|
+---+---+

This works because casting a boolean value to integer give 1 for True and 0 for False. If you prefer SQL, the equivalent is:

df.select([fn.expr('SUM(CAST(({c} IS NULL) AS INT)) AS {c}'.format(c=c)) for c in df.columns]).show()

or nicer, without a cast:

df.select([fn.expr('SUM(IF({c} IS NULL, 1, 0)) AS {c}'.format(c=c)) for c in df.columns]).show()

Find all nulls with SQL query over pyspark dataframe

Answers (2)

Related Questions