Spark - Permissive mode with JSON file moves all records to corrupt column

Question

I am trying ingest a JSON file using spark. I am applying schema manually to create Dataframe. the problem is even for a single record of schema is not matching it moves whole file (all the records) to corrupt column?

Data

[{
  "RecordNumber": 2,
  "Zipcode": 704,
  "ZipCodeType": "STANDARD",
  "City": "PASEO COSTA DEL SUR",
  "State": "PR"
},
{
  "RecordNumber": 10,
  "Zipcode": 709,
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
},{
  "Zipcode": "709aa",
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
}]

Code

import org.apache.spark.sql.types._
 import org.apache.spark.sql.types.DataTypes._
 val s = StructType(StructField("City",StringType,true) ::
 StructField("RecordNumber",LongType,true) ::
 StructField("State",StringType,true) ::
 StructField("ZipCodeType",StringType,true) ::
 StructField("Zipcode",LongType,true) ::
 StructField("corrupted_record",StringType,true) ::
 Nil)
 val df2=spark.read.
option("multiline","true").
option("mode", "PERMISSIVE").
option("columnNameOfCorruptRecord", "corrupted_record").
schema(s).
json("/tmp/test.json")
df2.show(false)

Output

scala> df2.filter($"corrupted_record".isNotNull).show(false)
+----+------------+-----+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|City|RecordNumber|State|ZipCodeType|Zipcode|corrupted_record                                                                                                                                                                                                                                                                                                                                |
+----+------------+-----+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|null|null        |null |null       |null   |[{
  "RecordNumber": 2,
  "Zipcode": 704,
  "ZipCodeType": "STANDARD",
  "City": "PASEO COSTA DEL SUR",
  "State": "PR"
},
{
  "RecordNumber": 10,
  "Zipcode": 709,
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
},{
  "Zipcode": "709aa",
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
}]

Question As only third record has Zipcode in String while I expect it to be integer ("Zipcode": "709aa",) shouldn't only third record go to corrupted_record column and others should be parsed correctly?

Spark - Permissive mode with JSON file moves all records to corrupt column

Answers (1)

Related Questions