Reputation: 12930
I am trying ingest a JSON file using spark. I am applying schema manually to create Dataframe. the problem is even for a single record of schema is not matching it moves whole file (all the records) to corrupt column?
Data
[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
},{
"Zipcode": "709aa",
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
Code
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataTypes._
val s = StructType(StructField("City",StringType,true) ::
StructField("RecordNumber",LongType,true) ::
StructField("State",StringType,true) ::
StructField("ZipCodeType",StringType,true) ::
StructField("Zipcode",LongType,true) ::
StructField("corrupted_record",StringType,true) ::
Nil)
val df2=spark.read.
option("multiline","true").
option("mode", "PERMISSIVE").
option("columnNameOfCorruptRecord", "corrupted_record").
schema(s).
json("/tmp/test.json")
df2.show(false)
Output
scala> df2.filter($"corrupted_record".isNotNull).show(false)
+----+------------+-----+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|City|RecordNumber|State|ZipCodeType|Zipcode|corrupted_record |
+----+------------+-----+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|null|null |null |null |null |[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
},{
"Zipcode": "709aa",
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
Question
As only third record has Zipcode in String
while I expect it to be integer
("Zipcode": "709aa",
) shouldn't only third record go to corrupted_record
column and others should be parsed correctly?
Upvotes: 0
Views: 2234
Reputation: 316
You have only one record(that's because of multiline,true) which is corrupt so everything goes there.
Like it says in the documentation if you want spark to treat records separately you need to use Json lines format which will also be more scalable for bigger files because spark will be able to distribute parsing in multiple executors.
Upvotes: 0