Reputation: 447
We want to break a nested data structure into separate entities with Spark & Scala. The structure is like:
root
|-- timestamp: string (nullable = true)
|-- contract: struct (nullable = true)
| |-- category: string (nullable = true)
| |-- contractId: array (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- active: boolean (nullable = true)
| | | |-- itemId: string (nullable = true)
| | | |-- subItems: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- elementId: string (nullable = true)
We want to have contracts, items and subItems in separate data collections. The sub-entities should contain references to their parent, and the top-level fields (timestamp) as audit fields.
Contracts:
Items:
SubItems:
We don't want to configure all the necessary attributes specifically, but only the respective parent attribute to extract, the foreignKey (reference), and what should NOT be extracted (e.g. contract should not contain items, an item should not contain subelements).
We tried with dataframe.select("*").select(explode("contract.*"))
and the likes, but we can't make it.
Any ideas on how to do this elegantly are welcome.
Best Alex
Upvotes: 0
Views: 1528
Reputation: 11
It's about how to flatten a row. The 'explode' function should use on an array.
dataframe.select("*")
.select(explode("contract.items.*").alias("ci_flat"))
.select("ci_flat.itemId", "ci_flat.subItems")
refs: Flattening Rows in Spark What's the difference between explode function and operator? How to unwind array in DataFrame (from JSON)?
Upvotes: 1