Reputation: 1267
I have a DF as a flattened order line with columns as:
orderId (String), orderLine (struct)
1, {"sequence":1,"productId":11111111,"productName":"Blah","quantity":1,"unitPrice":{"net":65},"totalPrice":{"gross":67.84,"net":65,"tax":2.84}}
1, {"sequence":2,"productId":22222222,"productName":"Blah2","quantity":1,"unitPrice":{"net":100},"totalPrice":{"gross":104.38,"net":100,"tax":4.38}}
What is the most efficient way to generate a dataframe from this that is:
orderId (string), orderLines (Array of orderLine Struct)
Essentially grouping/collecting the individual lines structs for a given order into an array of line items - in this example orderLines would have the 2 orderLine items as part of the array.
Upvotes: 0
Views: 883
Reputation: 74739
I'd use groupBy
and collect_list
function as follows:
orders.groupBy("orderId").agg(collect_list("orderLine"))
See Dataset (for groupBy
) and functions object (for collect_list
function).
Upvotes: 3