Load nested json as string

Question

I have very complex and highly nested json structure, stored as a string in a Hive table.

Is there any way to define a partial schema for Spark to describes only root elements so then I can load one of the child structures as a whole as a string?

To clarify, here is root elements of my json:

{
"meta": {....},
"entry": [{..}, {...}]
}

I do not want to declare schema for the whole thing but only for root elements meta and entry. Then I need to extract entries as an array of strings, with every entry being a separate json document.

Something like below which unfortunately does not work (tried in Spark 2.2)

schema = StructType(
        [
            StructField("meta", StringType(), True),
            StructField("entry", ArrayType(StringType(), True), True)

        ]
    )

rdd = rdd_src.map(lambda row: str(row.json_payload))
bundle = spark.read.json(rdd, schema=schema, multiLine=True)

Basically, the end goal is to get an array of strings from entry and every string will be a separate json document. My code above does not throw any error messages but resulting dataframe contains rows with blank values.

user9627474 · Accepted Answer

There is nothing particularly wrong with your approach and it should work just fine:

>>> spark.version
'2.2.1'
>>> 
>>> from pyspark.sql.types import *
>>> schema = StructType(
...         [
...             StructField("meta", StringType(), True),
...             StructField("entry", ArrayType(StringType(), True), T
... rue)
... 
...         ]
...     )
...     
>>> json = """{"meta": {"foo": "bar"}, "entry": [{"foo": "bar"}, {"bar": "foo"}]}"""
>>> 
>>> spark.read.schema(schema).json(sc.parallelize([json])).show()
+-------------+--------------------+
|         meta|               entry|
+-------------+--------------------+
|{"foo":"bar"}|[{"foo":"bar"}, {...|
+-------------+--------------------+

If you get empty values, it is likely because documents (including content of meta or entry) are not valid JSON, and don't properly parse.

Load nested json as string

Answers (1)

Related Questions