Alex
Alex

Reputation: 447

How to extract only certain attribute levels from a nested structure in a spark dataframe

We want to break a nested data structure into separate entities with Spark & Scala. The structure is like:

root
 |-- timestamp: string (nullable = true)
 |-- contract: struct (nullable = true)
 |    |-- category: string (nullable = true)
 |    |-- contractId: array (nullable = true)
 |    |-- items: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- active: boolean (nullable = true)
 |    |    |    |-- itemId: string (nullable = true)
 |    |    |    |-- subItems: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- elementId: string (nullable = true)

We want to have contracts, items and subItems in separate data collections. The sub-entities should contain references to their parent, and the top-level fields (timestamp) as audit fields.

Contracts:

Items:

SubItems:

We don't want to configure all the necessary attributes specifically, but only the respective parent attribute to extract, the foreignKey (reference), and what should NOT be extracted (e.g. contract should not contain items, an item should not contain subelements).

We tried with dataframe.select("*").select(explode("contract.*")) and the likes, but we can't make it. Any ideas on how to do this elegantly are welcome.

Best Alex

Upvotes: 0

Views: 1528

Answers (1)

hongmin
hongmin

Reputation: 11

It's about how to flatten a row. The 'explode' function should use on an array.

dataframe.select("*")
    .select(explode("contract.items.*").alias("ci_flat"))
    .select("ci_flat.itemId", "ci_flat.subItems")

refs: Flattening Rows in Spark What's the difference between explode function and operator? How to unwind array in DataFrame (from JSON)?

Upvotes: 1

Related Questions