How to extract only certain attribute levels from a nested structure in a spark dataframe

Question

We want to break a nested data structure into separate entities with Spark & Scala. The structure is like:

root
 |-- timestamp: string (nullable = true)
 |-- contract: struct (nullable = true)
 |    |-- category: string (nullable = true)
 |    |-- contractId: array (nullable = true)
 |    |-- items: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- active: boolean (nullable = true)
 |    |    |    |-- itemId: string (nullable = true)
 |    |    |    |-- subItems: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- elementId: string (nullable = true)

We want to have contracts, items and subItems in separate data collections. The sub-entities should contain references to their parent, and the top-level fields (timestamp) as audit fields.

Contracts:

auditTimestamp
category
contractId

Items:

auditTimestamp
contractId (foreignKey)
active
itemId

SubItems:

auditTimestamp
itemId (foreignKey)
elementID

We don't want to configure all the necessary attributes specifically, but only the respective parent attribute to extract, the foreignKey (reference), and what should NOT be extracted (e.g. contract should not contain items, an item should not contain subelements).

We tried with dataframe.select("*").select(explode("contract.*")) and the likes, but we can't make it. Any ideas on how to do this elegantly are welcome.

Best Alex

How to extract only certain attribute levels from a nested structure in a spark dataframe

Answers (1)

Related Questions