Sanjeev S
Sanjeev S

Reputation: 11

How to collect bad rows/records in pyspark while processing?

Hello I'm new to Pyspark, While converting a string column "DOB" in my test.csv file to Date format I had an issue where Pyspark converts the bad records to null values. I'm aware of pyspark's method in handling bad data like , PERMISSIVE mode, FAILFAST mode, BadRecordPaths, which collects the bad data while reading the file/input, but is there any possible way to collect the bad records while running a function?

The Code is:

spark = SparkSession.builder.appName("Dob_test").getOrCreate()
df = spark.read.option("header", True).csv("./DataSource/test.csv")
df.show()
df.select(to_date(col("DOB"), "MM-dd-yyyy").alias("Date Of Birth")).show()

Input: DOB input

The Result:

Result_DOB

In Result_DOB, Date Of Birth column the date of birth for NAME3 and NAME 4 is null. So I want to collect NAME3 and NAME4 records as error records/corrupt records while running this line

df.select(to_date(col("DOB"), "MM-dd-yyyy").alias("Date Of Birth")).show()

Any possible input would be valuable. Thank you.

Upvotes: 1

Views: 1537

Answers (1)

Anjaneya Tripathi
Anjaneya Tripathi

Reputation: 1459

You can have another column which keeps the value if it cannot be casted like below

from pyspark.sql.functions import *
tdf=spark.createDataFrame([("Name1","05-25-1994"),("Name2","11-10-1998"),("Name3","et-05-1994"),("Name4","Null")],"name string, DOB string")
tdf.withColumn("corruptDOB",when(to_date(col("DOB"), "MM-dd-yyyy").isNull(),col("DOB"))).select(to_date(col("DOB"), "MM-dd-yyyy").alias("Date Of Birth"),"corruptDOB").show()

will give you output like below

+-------------+----------+
|Date Of Birth|corruptDOB|
+-------------+----------+
|   1994-05-25|      null|
|   1998-11-10|      null|
|         null|et-05-1994|
|         null|      Null|
+-------------+----------+

Upvotes: 2

Related Questions