Reputation: 1
I have a dataframe like this:
Name | Model | Age | Rating |
---|---|---|---|
Bob | short | 12 | 4.5 |
Bob | big | 32 | 4 |
Dav | big | 46 | 5 |
And i'd like to produce this:
Name | Model | Age | Rating |
---|---|---|---|
Bob | ["short","big"] | 12 | 4.5 |
Dav | big | 46 | 5 |
the remaining fields are filled with the data of the line that has the highest rating, and there should be no duplicate models.
Thank you
Upvotes: 0
Views: 592
Reputation: 652
It is difficult to know exactly what logic will fit your use case without additional information about the data.
For example:
Hopefully the below code will at least get you started down the right path.
I assumed that the value for age and rating can be any of the input rows with the given name. It also only produces the unique set of model names.
df.groupBy(col("name"))
.agg(
collect_set(col("model")).as("model"), // Unique set of model names in an array
first(col("age")).as("age"),
first(col("rating")).as("rating")
)
Some other functions to look into might be: collect_list
, last
, min
, max
, etc.
Upvotes: 1