Reputation: 111
I have JSON data which contain multiple nesting as below. Depth of nesting is not fixed.
JSON field value
{
"hierarchy": {
"record": {
"id": 1,
"record": [
{
"id": 2,
"record": [
{
"id": 3,
"record": [
{
"id": 4,
"record": []
},
{
"id": 5,
"record": []
}
]
}
]
},
{
"id": 6,
"record": [
{
"id": 7
}
]
}
]
}
},
"type": "record"
}
df = spark.read.option("multiLine", True).json(file.json)
df.printSchema()
df.show(100,False)
This is the initial schema I get. I attempted explode but unable to solve the unknown dept of nesting it can have.
<class 'pyspark.sql.dataframe.DataFrame'>
root
|-- hierarchy: struct (nullable = true)
| |-- record: struct (nullable = true)
| | |-- id: long (nullable = true)
| | |-- record: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- id: long (nullable = true)
| | | | |-- record: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- id: long (nullable = true)
| | | | | | |-- record: array (nullable = true)
| | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | |-- id: long (nullable = true)
| | | | | | | | |-- record: array (nullable = true)
| | | | | | | | | |-- element: string (containsNull = true)
|-- type: string (nullable = true)
+--------------------------------------------------------------------------------------------------------------------------+------+
|hierarchy |type |
+--------------------------------------------------------------------------------------------------------------------------+------+
|[[1,WrappedArray([2,WrappedArray([3,WrappedArray([4,WrappedArray()], [5,WrappedArray()])])], [6,WrappedArray([7,null])])]]|record|
+--------------------------------------------------------------------------------------------------------------------------+------+
Want to get this flattened in the way that each record is turned in a row along with id and parent's id
record_field id parent_id
=============================
record 1 null
record 2 1
record 3 2
record 4 3
record 5 3
record 6 1
record 7 6
Is it possible to do this in spark sql (pyspark).
Upvotes: 1
Views: 1897
Reputation: 111
For those who may find it helpful I ended up using recursion to achieve this.
def iter_cls(parentId, obj):
if (len(obj)>0):
for obj_item in obj:
yield (parentId, obj_item.get("id"), obj_item.get("classifKey"))
yield from iter_cls(obj_item.get("id"), obj_item.get("record"))
data = json.loads(str)
cls_data = data["hierarchy"]
top_cls = cls_data["record"]
top_cls_id = top_cls["id"]
listAll = ("{0}|{1}|{2}".format(ParendID,id,name)
for (pid, id, name, clevel) in iter_cls(top_cls_id,top_cls["record"])
)
df2 = spark.createDataFrame(listAll, StringType())
split_col = split("value", "\\|")
df2.select(split_col.getItem(0).alias("parentid")
,split_col.getItem(1).alias("current_id")
,split_col.getItem(2).alias("classifKey")
,split_col.getItem(3).alias("clevel")
).show()
Upvotes: 1