Reputation: 168
I have a scenario, where I have to find columns with Null values per each record, and write all such column names into a separate column.
Example: I have this DataFrame:
+---------+---+------------+-----------+------+-------+
|firstName|age|jobStartDate|isGraduated|gender| salary|
+---------+---+------------+-----------+------+-------+
| null|se3| 2006-01-01| 8| M| F|
| null| a3| null| True| F| null|
| Robert| 37| 1992-01-01| null| M|5000.50|
+---------+---+------------+-----------+------+-------+
Expected result should be like the one below:
+---------+---+------------+-----------+------+-------+----------------------+
|firstName|age|jobStartDate|isGraduated|gender| salary| Missing Columns|
+---------+---+------------+-----------+------+-------+----------------------+
| null|se3| 2006-01-01| 8| M| F| firstName|
| null| a3| 2006-01-02| True| F| null| firstName,salary|
| Robert| 37| 1992-01-01| null| M|5000.50| isGraduated|
+---------+---+------------+-----------+------+-------+----------------------+
I have written code which half meets my expected results:
def find_exceptions(df,mand_cols = ['firstName','jobStartDate','salary']):
miss = "Missing: "
for column in mand_cols:
if df[column] is None:
miss = miss + column + ","
return miss
I am able to collect the missing values as list:
temp = sourceDF.rdd.map(find_exceptions)
temp.collect()
#result:
['Missing: firstName,', 'Missing: firstName,jobStartDate,salary,', 'Missing: ']
I am finding it difficult to actually write this into a new column. I am fairly new to Spark and would really appreciate if someone could help me with this.
Upvotes: 1
Views: 949
Reputation: 494
You can do this in three steps.
Step 1: Create an array of size number of columns. If an entry is null, then set the respective element in array as the name of column name, else leave the value null.
Step 2: Filter the array for column names
Step 3: Concatenate to have comma-separated list
df //step 1
.withColumn("MissingColumns",
array(
when(col("firstName").isNull(),lit("firstName")),
when(col("age").isNull(),lit("age")),
when(col("jobStartDate").isNull(),lit("jobStartDate")),
when(col("isGraduated").isNull(),lit("isGraduated")),
when(col("gender").isNull(),lit("gender")),
when(col("salary").isNull(),lit("salary"))
)
)
//step 2
.withColumn("MissingColumns",expr("filter(MissingColumns, c -> c IS NOT NULL)"))
//step 3
.withColumn("MissingColumns",concat_ws(",",col("MissingColumns")) )
Upvotes: 2