Counting number of nulls in pyspark dataframe by row

Question

So I want to count the number of nulls in a dataframe by row.

Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.

For example, a subset:

columns = ['id', 'item1', 'item2', 'item3']
vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)]
df=spark.createDataFrame(vals,columns)
df.show()

+---+-----+-----+-----+
| id|item1|item2|item3|
+---+-----+-----+-----+
|  1|    2|  'A'| null|
|  2| null|    1| null|
|  3| null|    9|  'C'|
+---+-----+-----+-----+

After running the code, the desired output is:

+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
|  1|    2|  'A'| null|       1|
|  2| null|    1| null|       2|
|  3| null|    9|  'C'|       1|
+---+-----+-----+-----+--------+

EDIT: Not all non null values are ints.

akuiper · Accepted Answer

Convert null to 1 and others to 0 and then sum all the columns:

df.withColumn('numNulls', sum(df[col].isNull().cast('int') for col in df.columns)).show()
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
|  1|    2|    0| null|       1|
|  2| null|    1| null|       2|
|  3| null|    9|    1|       1|
+---+-----+-----+-----+--------+

Counting number of nulls in pyspark dataframe by row

Answers (2)

Related Questions