pyspark corrupt_record while reading json file

Question

I have a json which can't be read by spark(spark.read.json("xxx").show())

{'event_date_utc': None,'deleted': False, 'cost':1 , 'name':'Mike'}

The problem seems to be the None and False are not under single quote, and spark can't default them to boolean, null or even string.

I tried to give my spark read a schema instead of inferred by forcing those 2 column to be string and have the same error.

Feel like to me spark is trying to read the data first then apply schema then failed in the read part.

Is there a way to tell spark to read those values without modify the input data? I am using python.

blackbishop · Accepted Answer

You input isn't a valid JSON so you can't read it using spark.read.json. Instead, you can load it as text DataFrame with spark.read.text and parse the stringified dict into json using UDF:

import ast
import json
from pyspark.sql import functions as F
from pyspark.sql.types import *

schema = StructType([
    StructField("event_date_utc", StringType(), True),
    StructField("deleted", BooleanType(), True),
    StructField("cost", IntegerType(), True),
    StructField("name", StringType(), True)
])

dict_to_json = F.udf(lambda x: json.dumps(ast.literal_eval(x)))

df = spark.read.text("xxx") \
    .withColumn("value", F.from_json(dict_to_json("value"), schema)) \
    .select("value.*")

df.show()

#+--------------+-------+----+----+
#|event_date_utc|deleted|cost|name|
#+--------------+-------+----+----+
#|null          |false  |1   |Mike|
#+--------------+-------+----+----+

pyspark corrupt_record while reading json file

Answers (2)

Related Questions