milton
milton

Reputation: 111

pyspark corrupt_record while reading json file

I have a json which can't be read by spark(spark.read.json("xxx").show())

{'event_date_utc': None,'deleted': False, 'cost':1 , 'name':'Mike'}

The problem seems to be the None and False are not under single quote, and spark can't default them to boolean, null or even string.

I tried to give my spark read a schema instead of inferred by forcing those 2 column to be string and have the same error.

Feel like to me spark is trying to read the data first then apply schema then failed in the read part.

Is there a way to tell spark to read those values without modify the input data? I am using python.

Upvotes: 2

Views: 3641

Answers (2)

blackbishop
blackbishop

Reputation: 32700

You input isn't a valid JSON so you can't read it using spark.read.json. Instead, you can load it as text DataFrame with spark.read.text and parse the stringified dict into json using UDF:

import ast
import json
from pyspark.sql import functions as F
from pyspark.sql.types import *

schema = StructType([
    StructField("event_date_utc", StringType(), True),
    StructField("deleted", BooleanType(), True),
    StructField("cost", IntegerType(), True),
    StructField("name", StringType(), True)
])

dict_to_json = F.udf(lambda x: json.dumps(ast.literal_eval(x)))

df = spark.read.text("xxx") \
    .withColumn("value", F.from_json(dict_to_json("value"), schema)) \
    .select("value.*")

df.show()

#+--------------+-------+----+----+
#|event_date_utc|deleted|cost|name|
#+--------------+-------+----+----+
#|null          |false  |1   |Mike|
#+--------------+-------+----+----+

Upvotes: 2

mck
mck

Reputation: 42422

The JSON doesn't look good. Field values needs to be quoted.

You can eval the lines first, which look like they're in Python dict format.

df = spark.createDataFrame(
    sc.textFile('true.json').map(eval),
    'event_date_utc boolean, deleted boolean, cost int, name string'
)

df.show()
+--------------+-------+----+----+
|event_date_utc|deleted|cost|name|
+--------------+-------+----+----+
|          null|  false|   1|Mike|
+--------------+-------+----+----+

Upvotes: 1

Related Questions