Juan David
Juan David

Reputation: 381

How to count missing data per row in a data frame

I got this dataframe Sample:

from pyspark.sql.types import *

schema = StructType([
StructField("ClientId", IntegerType(), True),
StructField("m_ant21", IntegerType(), True),
StructField("m_ant22", IntegerType(), True),
StructField("m_ant23", IntegerType(), True),
StructField("m_ant24", IntegerType(), True)])

df = sqlContext.createDataFrame(
                         data=[(0, None, None, None, None),
                               (1, 23, 13, 17, 99),
                               (2, 0, 0, 0, 1),
                               (3, 0, None, 1, 0),
                               (4, None, None, None, None)],
                               schema=schema)

I have this data frame:

 +--------+-------+-------+-------+-------+
 |ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
 +--------+-------+-------+-------+-------+
 |       0|   null|   null|   null|   null|
 |       1|     23|     13|     17|     99|
 |       2|      0|      0|      0|      1|
 |       3|      0|   null|      1|      0|
 |       4|   null|   null|   null|   null|
 +--------+-------+-------+-------+-------+

And I need to solve this question: I'd like to create a new variable which counts how many null values have the data per row. For example:

Note that df is a pyspark.sql.dataframe.DataFrame.

Upvotes: 1

Views: 834

Answers (1)

akuiper
akuiper

Reputation: 215077

Here is one option:

from pyspark.sql import Row

# add the column schema to the original schema
schema.add(StructField("count_null", IntegerType(), True))

# convert data frame to rdd and append an element to each row to count the number of nulls
df.rdd.map(lambda row: row + Row(sum(x is None for x in row))).toDF(schema).show()

+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
|       0|   null|   null|   null|   null|         4|
|       1|     23|     13|     17|     99|         0|
|       2|      0|      0|      0|      1|         0|
|       3|      0|   null|      1|      0|         1|
|       4|   null|   null|   null|   null|         4|
+--------+-------+-------+-------+-------+----------+

If you don't want to deal with schema, here is another option:

from pyspark.sql.functions import col, when

df.withColumn("count_null", sum([when(col(x).isNull(),1).otherwise(0) for x in df.columns])).show()

+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
|       0|   null|   null|   null|   null|         4|
|       1|     23|     13|     17|     99|         0|
|       2|      0|      0|      0|      1|         0|
|       3|      0|   null|      1|      0|         1|
|       4|   null|   null|   null|   null|         4|
+--------+-------+-------+-------+-------+----------+

Upvotes: 2

Related Questions