Etisha
Etisha

Reputation: 317

Casting the Dataframe columns with validation in spark

I need to cast the column of the data frame containing values as all string to a defined schema data types. While doing the casting we need to put the corrupt records (records which are of wrong data types) into a separate column

Example of Dataframe

+---+----------+-----+
|id |name      |class|
+---+----------+-----+
|1  |abc       |21   |
|2  |bca       |32   |
|3  |abab      | 4   |
|4  |baba      |5a   |
|5  |cccca     |     |
+---+----------+-----+

Json Schema of the file:

 {"definitions":{},"$schema":"http://json-schema.org/draft-07/schema#","$id":"http://example.com/root.json","type":["object","null"],"required":["id","name","class"],"properties":{"id":{"$id":"#/properties/id","type":["integer","null"]},"name":{"$id":"#/properties/name","type":["string","null"]},"class":{"$id":"#/properties/class","type":["integer","null"]}}}

In this row 4 is corrupt records as the class column is of type Integer So only this records has to be there in corrupt records, not the 5th row

Upvotes: 3

Views: 1163

Answers (1)

user11692913
user11692913

Reputation: 36

Just check if value is NOT NULL before casting and NULL after casting

import org.apache.spark.sql.functions.when

df
  .withColumn("class_integer", $"class".cast("integer"))
  .withColumn(
    "class_corrupted", 
    when($"class".isNotNull and $"class_integer".isNull, $"class"))

Repeat for each column / cast you need.

Upvotes: 2

Related Questions