Reputation: 452
So I want to count the number of nulls in a dataframe by row.
Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.
For example, a subset:
columns = ['id', 'item1', 'item2', 'item3']
vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)]
df=spark.createDataFrame(vals,columns)
df.show()
+---+-----+-----+-----+
| id|item1|item2|item3|
+---+-----+-----+-----+
| 1| 2| 'A'| null|
| 2| null| 1| null|
| 3| null| 9| 'C'|
+---+-----+-----+-----+
After running the code, the desired output is:
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
| 1| 2| 'A'| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 'C'| 1|
+---+-----+-----+-----+--------+
EDIT: Not all non null values are ints.
Upvotes: 11
Views: 9548
Reputation: 1
# Create new dataFrame with only 'id' column and 'numNulls'(which count all null values by row) columns
# To create new dataFrame first convert old dataFrame into RDD and perform following operation and again convert it into DataFrame
df2 = df.rdd.map(lambda x: (x[0], x.count(None))).toDF(['id','numNulls'])
df2.show()
+---+--------+
| id|numNulls|
+---+--------+
| 1| 1|
| 2| 2|
| 3| 1|
+---+--------+
# now join old dataFrame and new dataFrame on the basis of 'id' column
df3 = df.join(df2, df.id == df2.id, 'inner').drop(df2.id)
df3.show()
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
| 1| 2| A| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| C| 1|
+---+-----+-----+-----+--------+
Upvotes: 0
Reputation: 214957
Convert null
to 1
and others to 0
and then sum
all the columns:
df.withColumn('numNulls', sum(df[col].isNull().cast('int') for col in df.columns)).show()
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
| 1| 2| 0| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 1| 1|
+---+-----+-----+-----+--------+
Upvotes: 21