Finding columns with Null values and write them in a new column per each record in Pyspark

Question

I have a scenario, where I have to find columns with Null values per each record, and write all such column names into a separate column.

Example: I have this DataFrame:

+---------+---+------------+-----------+------+-------+
|firstName|age|jobStartDate|isGraduated|gender| salary|
+---------+---+------------+-----------+------+-------+
|     null|se3|  2006-01-01|          8|     M|      F|
|     null| a3|        null|       True|     F|   null|
|   Robert| 37|  1992-01-01|       null|     M|5000.50|
+---------+---+------------+-----------+------+-------+

Expected result should be like the one below:

+---------+---+------------+-----------+------+-------+----------------------+
|firstName|age|jobStartDate|isGraduated|gender| salary|       Missing Columns|
+---------+---+------------+-----------+------+-------+----------------------+
|     null|se3|  2006-01-01|          8|     M|      F|             firstName|
|     null| a3|  2006-01-02|       True|     F|   null|      firstName,salary|
|   Robert| 37|  1992-01-01|       null|     M|5000.50|           isGraduated|
+---------+---+------------+-----------+------+-------+----------------------+

I have written code which half meets my expected results:

def find_exceptions(df,mand_cols = ['firstName','jobStartDate','salary']):
  miss = "Missing: "
  for column in mand_cols:
    if df[column] is None:
      miss = miss + column + ","
  return miss

I am able to collect the missing values as list:

temp = sourceDF.rdd.map(find_exceptions)
temp.collect()
#result: 
['Missing: firstName,', 'Missing: firstName,jobStartDate,salary,', 'Missing: ']

I am finding it difficult to actually write this into a new column. I am fairly new to Spark and would really appreciate if someone could help me with this.

Benoit Descamps · Accepted Answer

You can do this in three steps.

Step 1: Create an array of size number of columns. If an entry is null, then set the respective element in array as the name of column name, else leave the value null.

Step 2: Filter the array for column names

Step 3: Concatenate to have comma-separated list

df //step 1
    .withColumn("MissingColumns",
      array(
        when(col("firstName").isNull(),lit("firstName")),
        when(col("age").isNull(),lit("age")),
        when(col("jobStartDate").isNull(),lit("jobStartDate")),
        when(col("isGraduated").isNull(),lit("isGraduated")),
        when(col("gender").isNull(),lit("gender")),
        when(col("salary").isNull(),lit("salary"))
      )
    )
     //step 2
      .withColumn("MissingColumns",expr("filter(MissingColumns, c -> c IS NOT NULL)"))
     //step 3
      .withColumn("MissingColumns",concat_ws(",",col("MissingColumns")) )

Finding columns with Null values and write them in a new column per each record in Pyspark

Answers (1)

Related Questions