user400058
user400058

Reputation: 111

Flattening recursive nested data in Spark SQL

I have JSON data which contain multiple nesting as below. Depth of nesting is not fixed.

JSON field value

{ "hierarchy": { "record": { "id": 1, "record": [ { "id": 2, "record": [ { "id": 3, "record": [ { "id": 4, "record": [] }, { "id": 5, "record": [] } ] } ] }, { "id": 6, "record": [ { "id": 7 } ] } ] } }, "type": "record" }

df = spark.read.option("multiLine", True).json(file.json)    
df.printSchema()
df.show(100,False)

This is the initial schema I get. I attempted explode but unable to solve the unknown dept of nesting it can have.

<class 'pyspark.sql.dataframe.DataFrame'>
root
 |-- hierarchy: struct (nullable = true)
 |    |-- record: struct (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- record: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- id: long (nullable = true)
 |    |    |    |    |-- record: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- id: long (nullable = true)
 |    |    |    |    |    |    |-- record: array (nullable = true)
 |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |-- id: long (nullable = true)
 |    |    |    |    |    |    |    |    |-- record: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |-- element: string (containsNull = true)
 |-- type: string (nullable = true)

+--------------------------------------------------------------------------------------------------------------------------+------+
|hierarchy                                                                                                                 |type  |
+--------------------------------------------------------------------------------------------------------------------------+------+
|[[1,WrappedArray([2,WrappedArray([3,WrappedArray([4,WrappedArray()], [5,WrappedArray()])])], [6,WrappedArray([7,null])])]]|record|
+--------------------------------------------------------------------------------------------------------------------------+------+

Want to get this flattened in the way that each record is turned in a row along with id and parent's id

record_field    id  parent_id
=============================
record          1   null
record          2   1
record          3   2

record          4   3
record          5   3

record          6   1
record          7   6

Is it possible to do this in spark sql (pyspark).

Upvotes: 1

Views: 1897

Answers (1)

user400058
user400058

Reputation: 111

For those who may find it helpful I ended up using recursion to achieve this.

def iter_cls(parentId, obj):
    if (len(obj)>0):
        for obj_item in obj:
            yield (parentId, obj_item.get("id"), obj_item.get("classifKey"))
            yield from iter_cls(obj_item.get("id"), obj_item.get("record"))

data = json.loads(str)
cls_data = data["hierarchy"]
top_cls = cls_data["record"]
top_cls_id = top_cls["id"]

listAll = ("{0}|{1}|{2}".format(ParendID,id,name)
        for (pid, id, name, clevel) in iter_cls(top_cls_id,top_cls["record"])
     )


df2 = spark.createDataFrame(listAll, StringType())

split_col = split("value", "\\|")

df2.select(split_col.getItem(0).alias("parentid")
          ,split_col.getItem(1).alias("current_id")
          ,split_col.getItem(2).alias("classifKey")
          ,split_col.getItem(3).alias("clevel")
        ).show()

Upvotes: 1

Related Questions