Reputation: 317
I need to cast the column of the data frame containing values as all string to a defined schema data types. While doing the casting we need to put the corrupt records (records which are of wrong data types) into a separate column
Example of Dataframe
+---+----------+-----+
|id |name |class|
+---+----------+-----+
|1 |abc |21 |
|2 |bca |32 |
|3 |abab | 4 |
|4 |baba |5a |
|5 |cccca | |
+---+----------+-----+
Json Schema of the file:
{"definitions":{},"$schema":"http://json-schema.org/draft-07/schema#","$id":"http://example.com/root.json","type":["object","null"],"required":["id","name","class"],"properties":{"id":{"$id":"#/properties/id","type":["integer","null"]},"name":{"$id":"#/properties/name","type":["string","null"]},"class":{"$id":"#/properties/class","type":["integer","null"]}}}
In this row 4 is corrupt records as the class column is of type Integer So only this records has to be there in corrupt records, not the 5th row
Upvotes: 3
Views: 1163
Reputation: 36
Just check if value is NOT NULL
before casting and NULL
after casting
import org.apache.spark.sql.functions.when
df
.withColumn("class_integer", $"class".cast("integer"))
.withColumn(
"class_corrupted",
when($"class".isNotNull and $"class_integer".isNull, $"class"))
Repeat for each column / cast you need.
Upvotes: 2