How to collect order lines per group (using collect_list)?

Question

I have a DF as a flattened order line with columns as:

orderId (String), orderLine (struct)

1,  {"sequence":1,"productId":11111111,"productName":"Blah","quantity":1,"unitPrice":{"net":65},"totalPrice":{"gross":67.84,"net":65,"tax":2.84}}

1,  {"sequence":2,"productId":22222222,"productName":"Blah2","quantity":1,"unitPrice":{"net":100},"totalPrice":{"gross":104.38,"net":100,"tax":4.38}}

What is the most efficient way to generate a dataframe from this that is:

orderId (string), orderLines (Array of orderLine Struct)

Essentially grouping/collecting the individual lines structs for a given order into an array of line items - in this example orderLines would have the 2 orderLine items as part of the array.

Jacek Laskowski · Accepted Answer

I'd use groupBy and collect_list function as follows:

orders.groupBy("orderId").agg(collect_list("orderLine"))

See Dataset (for groupBy) and functions object (for collect_list function).

How to collect order lines per group (using collect_list)?

Answers (1)

Related Questions