too_many_questions
too_many_questions

Reputation: 553

Find all nulls with SQL query over pyspark dataframe

I have a dataframe of StructField with a mixed schema (DoubleType, StringType, LongType, etc.).

I want to 'iterate' over all columns to return summary statistics. For instance:

set_min = df.select([
        fn.min(self.df[c]).alias(c) for c in self.df.columns
    ]).collect()

Is what I'm using to find the minimum value in each column. That works fine. But when I try something designed similar to find Nulls:

set_null = df.filter(
       (lambda x: self.df[x]).isNull().count()
).collect()

I get the TypeError: condition should be string or Column which makes sense, I'm passing a function.

or with list comprehension:

set_null = self.df[c].alias(c).isNull() for c in self.df.columns

Then I try pass it a SQL query as a string:

set_null = df.filter('SELECT fields FROM table WHERE column = NUL').collect()

I get:

ParseException: "\nmismatched input 'FROM' expecting <EOF>(line 1, pos 14)\n\n== SQL ==\nSELECT fields FROM table WHERE column = NULL\n--------------^^^\n"

How can i pass my function as a 'string or column' so I can use filter or where alternatively, why wont the pure SQL statement work?

Upvotes: 1

Views: 912

Answers (2)

sgvd
sgvd

Reputation: 3939

There are things wrong in several parts of your attempts:

  • You are missing square brackets in your list comprehension example
  • You missed an L in NUL
  • Your pure SQL doesn't work, because filter/where expects a where clause, not a full SQL statement; they are just aliases and I prefer to use where so it is clearer you just need to give such a clause

In the end you don't need to use where, like karlson also shows. But subtracting from the total count means you have to evaluate the dataframe twice (which can be alleviated by caching, but still not ideal). There is a more direct way:

>>> df.select([fn.sum(fn.isnull(c).cast('int')).alias(c) for c in df.columns]).show()
+---+---+
|  A|  B|
+---+---+
|  2|  3|
+---+---+

This works because casting a boolean value to integer give 1 for True and 0 for False. If you prefer SQL, the equivalent is:

df.select([fn.expr('SUM(CAST(({c} IS NULL) AS INT)) AS {c}'.format(c=c)) for c in df.columns]).show()

or nicer, without a cast:

df.select([fn.expr('SUM(IF({c} IS NULL, 1, 0)) AS {c}'.format(c=c)) for c in df.columns]).show()

Upvotes: 2

karlson
karlson

Reputation: 5433

If you want a count of NULL values per column you could count the non-null values and subtract from the total.

For example:

from pyspark.sql import SparkSession
from pyspark.sql import functions as fn

spark = SparkSession.builder.master("local").getOrCreate()


df = spark.createDataFrame(
    data=[
        (1, None),
        (1, 1),
        (None, None),
        (1, 1),
        (None, 1),
        (1, None),
    ],
    schema=("A", "B")
)

total = df.count()
missing_counts = df.select(
    *[(total - fn.count(col)).alias("missing(%s)" % col) for col in df.columns]
)

missing_counts.show()
>>> +----------+----------+
... |missing(A)|missing(B)|
... +----------+----------+
... |         2|         3|
... +----------+----------+

Upvotes: 0

Related Questions