user7307258
user7307258

Reputation:

group by dataframe to a desired format on a specific column

i have a dataframe in spark

+------+----------+
|sno   | ssn     |
+------+----------+
|   123|200000000|         
|   789|200000002|         
|   123|200000000|         
|   123|200000001|         
|   894|200000001|          
+------+----------+

i wanted to group by sno and when i group by serial number the resulting dataframe should be

+------+----------+---------
|sno   | ssn               |
+------+----------+---------
|   123|200000000,200000001|         
|   789|200000002          |         
|   894|200000001          |          
+------+----------+--------|

I am new to spark and how would i do this

when i register the table as temp table and do a sql group by i couldn't get the results in above format , how do i get the results?

Upvotes: 1

Views: 48

Answers (1)

Apurba Pandey
Apurba Pandey

Reputation: 1076

You can use collect_set after grouping by sno. Below is the code for the same.

//Creating Test Data
val df = Seq((123, 200000000), (789, 200000002), (123, 200000000), (123, 200000001), (894, 200000001))
  .toDF("sno", "ssn")

val df1 = df.groupBy("sno")
    .agg(collect_set("ssn").as("ssn"))

df1.show(false)

+---+----------------------+
|sno|ssn                   |
+---+----------------------+
|123|[200000000, 200000001]|
|789|[200000002]           |
|894|[200000001]           |
+---+----------------------+

Upvotes: 2

Related Questions