How to collect bad rows/records in pyspark while processing?

Question

Hello I'm new to Pyspark, While converting a string column "DOB" in my test.csv file to Date format I had an issue where Pyspark converts the bad records to null values. I'm aware of pyspark's method in handling bad data like , PERMISSIVE mode, FAILFAST mode, BadRecordPaths, which collects the bad data while reading the file/input, but is there any possible way to collect the bad records while running a function?

The Code is:

spark = SparkSession.builder.appName("Dob_test").getOrCreate()
df = spark.read.option("header", True).csv("./DataSource/test.csv")
df.show()
df.select(to_date(col("DOB"), "MM-dd-yyyy").alias("Date Of Birth")).show()

Input: DOB input

The Result:

Result_DOB

In Result_DOB, Date Of Birth column the date of birth for NAME3 and NAME 4 is null. So I want to collect NAME3 and NAME4 records as error records/corrupt records while running this line

df.select(to_date(col("DOB"), "MM-dd-yyyy").alias("Date Of Birth")).show()

Any possible input would be valuable. Thank you.

How to collect bad rows/records in pyspark while processing?

Answers (1)

Related Questions