Reputation: 7957
I have a dataframe and need to see if it contains null values. There are plenty of posts on the same topic but nearly all of them use the count
action or the show
method.
count
operations are prohibitively expensive in my case as the data volume is large. Same for the show
method.
Is there a way in which I can ask spark to look for null values and raise an error as soon as it encounters the first null value?
The solutions in other posts give the count of missing values in each column. I don't need to know the number of missing values in every column. I just want to know if there is any cell in the dataframe with a null value.
Upvotes: 4
Views: 7840
Reputation: 28322
You have to potentially go through all values and check for null values. This can be done by either traversing the dataframe in a column-wise or row-wise fashion. Which one is best depends on the data (use heuristics).
Row-wise traversal:
import pyspark.sql.functions as f
from functools import reduce
df.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df.columns))).limit(1).collect().isEmpty
Column-wise traversal (empirically this should be faster, see comment by Clock Slave):
import pyspark.sql.functions as f
contains_nulls = False
for c in df.columns:
if not df.where(f.col(c).isNull()).limit(1).collect().isEmpty:
contains_nulls = True
break
limit(1)
is used to stop when the first null value is found and collect().isEmpty
to check if the dataframe is empty or not.
Upvotes: 2
Reputation: 173
As i understand your requirement is to just raise flag if any of the column has null. You don't need to know the which all actual rows are having null.
Solution: The easiest one i can think of creating a tempView of your DataFrame and check null on all the could. Here is the pseudocode for that-
YourDF.createOrReplaceTempView("tempView")
tempViewDF = sqlContext.sql("SELECT count(*) FROM tempView WHERE Col1 is null or Col2 is null or col3 is null")
flag=flase
if tempViewDF > 0:
flag=true
Now use flag as you want.
Regards, Anupam
Upvotes: 0
Reputation: 4540
You can use limit
for that
df.select("*").where(col("c").isNull()).limit(1)
Upvotes: 3