I have a dataframe like this: Name Model Age Rating Bob short 12 4.5 Bob big 32 4 Dav big 46 5 And i'd like to produce this: Name Model Age Rating Bob ["short","big"] 12 4.5 Dav big 46 5 the remaining fields are filled with the data of the line that has the highest rating, and there should be no duplicate models. Thank you

Reputation: 1

How to join distinct values from the same column in one row with Spark/Scala?

I have a dataframe like this:

Name	Model	Age	Rating
Bob	short	12	4.5
Bob	big	32	4
Dav	big	46	5

And i'd like to produce this:

Name	Model	Age	Rating
Bob	["short","big"]	12	4.5
Dav	big	46	5

the remaining fields are filled with the data of the line that has the highest rating, and there should be no duplicate models.

Thank you

Upvotes: 0

Answers (1)

Erp12

Reputation: 652

It is difficult to know exactly what logic will fit your use case without additional information about the data.

For example:

Does it matter which age and rating is assigned to the name after aggregating the values for model?
Will there ever be multiple rows with the same name and model? If so, do you want the model array to have the unique set of model names or include the duplicates?

Hopefully the below code will at least get you started down the right path.

I assumed that the value for age and rating can be any of the input rows with the given name. It also only produces the unique set of model names.

df.groupBy(col("name"))
  .agg(
    collect_set(col("model")).as("model"), // Unique set of model names in an array
    first(col("age")).as("age"),
    first(col("rating")).as("rating")
  )

Some other functions to look into might be: collect_list, last, min, max, etc.

Upvotes: 1

How to join distinct values from the same column in one row with Spark/Scala?

Answers (1)

Related Questions