mishkin
mishkin

Reputation: 6242

Load nested json as string

I have very complex and highly nested json structure, stored as a string in a Hive table.

Is there any way to define a partial schema for Spark to describes only root elements so then I can load one of the child structures as a whole as a string?

To clarify, here is root elements of my json:

{
"meta": {....},
"entry": [{..}, {...}]
}

I do not want to declare schema for the whole thing but only for root elements meta and entry. Then I need to extract entries as an array of strings, with every entry being a separate json document.

Something like below which unfortunately does not work (tried in Spark 2.2)

schema = StructType(
        [
            StructField("meta", StringType(), True),
            StructField("entry", ArrayType(StringType(), True), True)

        ]
    )

rdd = rdd_src.map(lambda row: str(row.json_payload))
bundle = spark.read.json(rdd, schema=schema, multiLine=True)

Basically, the end goal is to get an array of strings from entry and every string will be a separate json document. My code above does not throw any error messages but resulting dataframe contains rows with blank values.

Upvotes: 0

Views: 112

Answers (1)

user9627474
user9627474

Reputation:

There is nothing particularly wrong with your approach and it should work just fine:

>>> spark.version
'2.2.1'
>>> 
>>> from pyspark.sql.types import *
>>> schema = StructType(
...         [
...             StructField("meta", StringType(), True),
...             StructField("entry", ArrayType(StringType(), True), T
... rue)
... 
...         ]
...     )
...     
>>> json = """{"meta": {"foo": "bar"}, "entry": [{"foo": "bar"}, {"bar": "foo"}]}"""
>>> 
>>> spark.read.schema(schema).json(sc.parallelize([json])).show()
+-------------+--------------------+
|         meta|               entry|
+-------------+--------------------+
|{"foo":"bar"}|[{"foo":"bar"}, {...|
+-------------+--------------------+

If you get empty values, it is likely because documents (including content of meta or entry) are not valid JSON, and don't properly parse.

Upvotes: 1

Related Questions