Reputation: 553
I have a dataframe of StructField
with a mixed schema (DoubleType
, StringType
, LongType
, etc.).
I want to 'iterate' over all columns to return summary statistics. For instance:
set_min = df.select([
fn.min(self.df[c]).alias(c) for c in self.df.columns
]).collect()
Is what I'm using to find the minimum value in each column. That works fine. But when I try something designed similar to find Nulls:
set_null = df.filter(
(lambda x: self.df[x]).isNull().count()
).collect()
I get the TypeError: condition should be string or Column
which makes sense, I'm passing a function.
or with list comprehension:
set_null = self.df[c].alias(c).isNull() for c in self.df.columns
Then I try pass it a SQL query as a string:
set_null = df.filter('SELECT fields FROM table WHERE column = NUL').collect()
I get:
ParseException: "\nmismatched input 'FROM' expecting <EOF>(line 1, pos 14)\n\n== SQL ==\nSELECT fields FROM table WHERE column = NULL\n--------------^^^\n"
How can i pass my function as a 'string or column' so I can use filter
or where
alternatively, why wont the pure SQL statement work?
Upvotes: 1
Views: 912
Reputation: 3939
There are things wrong in several parts of your attempts:
NUL
filter
/where
expects a where clause, not a full SQL statement; they are just aliases and I prefer to use where
so it is clearer you just need to give such a clauseIn the end you don't need to use where
, like karlson also shows. But subtracting from the total count means you have to evaluate the dataframe twice (which can be alleviated by caching, but still not ideal). There is a more direct way:
>>> df.select([fn.sum(fn.isnull(c).cast('int')).alias(c) for c in df.columns]).show()
+---+---+
| A| B|
+---+---+
| 2| 3|
+---+---+
This works because casting a boolean value to integer give 1
for True
and 0
for False
. If you prefer SQL, the equivalent is:
df.select([fn.expr('SUM(CAST(({c} IS NULL) AS INT)) AS {c}'.format(c=c)) for c in df.columns]).show()
or nicer, without a cast:
df.select([fn.expr('SUM(IF({c} IS NULL, 1, 0)) AS {c}'.format(c=c)) for c in df.columns]).show()
Upvotes: 2
Reputation: 5433
If you want a count of NULL
values per column you could count the non-null values and subtract from the total.
For example:
from pyspark.sql import SparkSession
from pyspark.sql import functions as fn
spark = SparkSession.builder.master("local").getOrCreate()
df = spark.createDataFrame(
data=[
(1, None),
(1, 1),
(None, None),
(1, 1),
(None, 1),
(1, None),
],
schema=("A", "B")
)
total = df.count()
missing_counts = df.select(
*[(total - fn.count(col)).alias("missing(%s)" % col) for col in df.columns]
)
missing_counts.show()
>>> +----------+----------+
... |missing(A)|missing(B)|
... +----------+----------+
... | 2| 3|
... +----------+----------+
Upvotes: 0