Gaurang Shah
Gaurang Shah

Reputation: 12930

Spark - Permissive mode with JSON file moves all records to corrupt column

I am trying ingest a JSON file using spark. I am applying schema manually to create Dataframe. the problem is even for a single record of schema is not matching it moves whole file (all the records) to corrupt column?

Data

[{
  "RecordNumber": 2,
  "Zipcode": 704,
  "ZipCodeType": "STANDARD",
  "City": "PASEO COSTA DEL SUR",
  "State": "PR"
},
{
  "RecordNumber": 10,
  "Zipcode": 709,
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
},{
  "Zipcode": "709aa",
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
}]

Code

import org.apache.spark.sql.types._
 import org.apache.spark.sql.types.DataTypes._
 val s = StructType(StructField("City",StringType,true) ::
 StructField("RecordNumber",LongType,true) ::
 StructField("State",StringType,true) ::
 StructField("ZipCodeType",StringType,true) ::
 StructField("Zipcode",LongType,true) ::
 StructField("corrupted_record",StringType,true) ::
 Nil)
 val df2=spark.read.
option("multiline","true").
option("mode", "PERMISSIVE").
option("columnNameOfCorruptRecord", "corrupted_record").
schema(s).
json("/tmp/test.json")
df2.show(false)

Output

scala> df2.filter($"corrupted_record".isNotNull).show(false)
+----+------------+-----+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|City|RecordNumber|State|ZipCodeType|Zipcode|corrupted_record                                                                                                                                                                                                                                                                                                                                |
+----+------------+-----+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|null|null        |null |null       |null   |[{
  "RecordNumber": 2,
  "Zipcode": 704,
  "ZipCodeType": "STANDARD",
  "City": "PASEO COSTA DEL SUR",
  "State": "PR"
},
{
  "RecordNumber": 10,
  "Zipcode": 709,
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
},{
  "Zipcode": "709aa",
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
}]

Question As only third record has Zipcode in String while I expect it to be integer ("Zipcode": "709aa",) shouldn't only third record go to corrupted_record column and others should be parsed correctly?

Upvotes: 0

Views: 2234

Answers (1)

BulentB
BulentB

Reputation: 316

You have only one record(that's because of multiline,true) which is corrupt so everything goes there.

Like it says in the documentation if you want spark to treat records separately you need to use Json lines format which will also be more scalable for bigger files because spark will be able to distribute parsing in multiple executors.

Upvotes: 0

Related Questions