Reputation: 6242
I have very complex and highly nested json structure, stored as a string in a Hive table.
Is there any way to define a partial schema for Spark to describes only root elements so then I can load one of the child structures as a whole as a string?
To clarify, here is root elements of my json:
{
"meta": {....},
"entry": [{..}, {...}]
}
I do not want to declare schema for the whole thing but only for root elements meta and entry. Then I need to extract entries as an array of strings, with every entry being a separate json document.
Something like below which unfortunately does not work (tried in Spark 2.2)
schema = StructType(
[
StructField("meta", StringType(), True),
StructField("entry", ArrayType(StringType(), True), True)
]
)
rdd = rdd_src.map(lambda row: str(row.json_payload))
bundle = spark.read.json(rdd, schema=schema, multiLine=True)
Basically, the end goal is to get an array of strings from entry and every string will be a separate json document. My code above does not throw any error messages but resulting dataframe contains rows with blank values.
Upvotes: 0
Views: 112
Reputation:
There is nothing particularly wrong with your approach and it should work just fine:
>>> spark.version
'2.2.1'
>>>
>>> from pyspark.sql.types import *
>>> schema = StructType(
... [
... StructField("meta", StringType(), True),
... StructField("entry", ArrayType(StringType(), True), T
... rue)
...
... ]
... )
...
>>> json = """{"meta": {"foo": "bar"}, "entry": [{"foo": "bar"}, {"bar": "foo"}]}"""
>>>
>>> spark.read.schema(schema).json(sc.parallelize([json])).show()
+-------------+--------------------+
| meta| entry|
+-------------+--------------------+
|{"foo":"bar"}|[{"foo":"bar"}, {...|
+-------------+--------------------+
If you get empty values, it is likely because documents (including content of meta
or entry
) are not valid JSON, and don't properly parse.
Upvotes: 1