Reputation: 11
Hello I'm new to Pyspark, While converting a string column "DOB" in my test.csv file to Date format I had an issue where Pyspark converts the bad records to null values. I'm aware of pyspark's method in handling bad data like , PERMISSIVE mode, FAILFAST mode, BadRecordPaths, which collects the bad data while reading the file/input, but is there any possible way to collect the bad records while running a function?
The Code is:
spark = SparkSession.builder.appName("Dob_test").getOrCreate()
df = spark.read.option("header", True).csv("./DataSource/test.csv")
df.show()
df.select(to_date(col("DOB"), "MM-dd-yyyy").alias("Date Of Birth")).show()
Input: DOB input
The Result:
In Result_DOB, Date Of Birth column the date of birth for NAME3 and NAME 4 is null. So I want to collect NAME3 and NAME4 records as error records/corrupt records while running this line
df.select(to_date(col("DOB"), "MM-dd-yyyy").alias("Date Of Birth")).show()
Any possible input would be valuable. Thank you.
Upvotes: 1
Views: 1537
Reputation: 1459
You can have another column which keeps the value if it cannot be casted like below
from pyspark.sql.functions import *
tdf=spark.createDataFrame([("Name1","05-25-1994"),("Name2","11-10-1998"),("Name3","et-05-1994"),("Name4","Null")],"name string, DOB string")
tdf.withColumn("corruptDOB",when(to_date(col("DOB"), "MM-dd-yyyy").isNull(),col("DOB"))).select(to_date(col("DOB"), "MM-dd-yyyy").alias("Date Of Birth"),"corruptDOB").show()
will give you output like below
+-------------+----------+
|Date Of Birth|corruptDOB|
+-------------+----------+
| 1994-05-25| null|
| 1998-11-10| null|
| null|et-05-1994|
| null| Null|
+-------------+----------+
Upvotes: 2