Spark - Maintaing order of data across columns during groupby and collect

Question

If I have

ID  Name     Code    Value
1   Person1  A       12
1   Person2  B       15

And I do a

df.groupBy("ID").agg(
collect_set("Name").alias("Name"),
collect_set("Code").alias("Code"),
collect_set("Value").alias("Value")
)

I might get a

1, [Person1, Person2], [B,A], [15,12]

I need to get a

1, [Person1, Person2], [A,B], [12,15]

How do I ensure the same order for all columns ?

My actual df has 70 columns, I need to group by one columns and pick the first 5 unique values for each column in the correct order

Any suggestions are deeply appreciated

Raphael Roth · Accepted Answer

You cannot be sure about the order in your sets, I would suggest to pack the attributes in a struct, this will give you 1 array instead of 3.

df.groupBy("ID").agg(
  collect_list(struct("Name","Code","Value").as("Attribute")).as("Attributes")
)

Answers (2)