Reputation: 111
I have a json which can't be read by spark(spark.read.json("xxx").show()
)
{'event_date_utc': None,'deleted': False, 'cost':1 , 'name':'Mike'}
The problem seems to be the None and False are not under single quote, and spark can't default them to boolean, null or even string.
I tried to give my spark read a schema instead of inferred by forcing those 2 column to be string and have the same error.
Feel like to me spark is trying to read the data first then apply schema then failed in the read part.
Is there a way to tell spark to read those values without modify the input data? I am using python.
Upvotes: 2
Views: 3641
Reputation: 32700
You input isn't a valid JSON so you can't read it using spark.read.json
. Instead, you can load it as text DataFrame with spark.read.text
and parse the stringified dict into json using UDF:
import ast
import json
from pyspark.sql import functions as F
from pyspark.sql.types import *
schema = StructType([
StructField("event_date_utc", StringType(), True),
StructField("deleted", BooleanType(), True),
StructField("cost", IntegerType(), True),
StructField("name", StringType(), True)
])
dict_to_json = F.udf(lambda x: json.dumps(ast.literal_eval(x)))
df = spark.read.text("xxx") \
.withColumn("value", F.from_json(dict_to_json("value"), schema)) \
.select("value.*")
df.show()
#+--------------+-------+----+----+
#|event_date_utc|deleted|cost|name|
#+--------------+-------+----+----+
#|null |false |1 |Mike|
#+--------------+-------+----+----+
Upvotes: 2
Reputation: 42422
The JSON doesn't look good. Field values needs to be quoted.
You can eval
the lines first, which look like they're in Python dict format.
df = spark.createDataFrame(
sc.textFile('true.json').map(eval),
'event_date_utc boolean, deleted boolean, cost int, name string'
)
df.show()
+--------------+-------+----+----+
|event_date_utc|deleted|cost|name|
+--------------+-------+----+----+
| null| false| 1|Mike|
+--------------+-------+----+----+
Upvotes: 1