Lucien
Lucien

Reputation: 99

Spark Java - Merge same column multiple rows

I'm using Java Spark and I have 1 Dataframe like this

+---+-----+------+
|id |color|datas |
+----------------+
|1  |blue  |data1|
|1  |red   |data2|
|1  |orange|data3|
|2  |black |data4|
|2  |      |data5|
|2  |yellow|     |
|3  |white |data7|
|3  |      |data8|
+----------------+

I need to modify this dataframe to look like this :

+---+--------------------+---------------------+
|id |color               |datas                |
+----------------------------------------------+
|1  |[blue, red, orange] |[data1, data2, data3]|
|2  |[black, yellow]     |[data4, data5]       |
|3  |[white]             |[data7, data8]       |
+----------------------------------------------+

I want to merge the data to create an 'array' of the same column but from differents rows based on the 'id' column.

I'm able to do it throught UserDefinedAggregateFunction but I can only do it one column at a time and it takes too much time to process.

Thank you

Edit : I'm using Spark 1.6

Upvotes: 3

Views: 1250

Answers (2)

The actual function that works for me is:

dataframe.groupBy("id").agg(collect_list("color").as("color"), collect_list("date").as("date") ) dataframe.createOrReplaceTempView("dataframe")

Then create a new query where you can use the struct()

dffinal = spark.sql(s"""SELECT struct(a.color) AS colors, struct(a.date) AS dates FROM dataframe a """)

Upvotes: 0

koiralo
koiralo

Reputation: 23099

you can group by "id" and then use collect_list function to get the aggregated values.

dataframe.groupBy("id").agg(collect_list(struct("color")).as("color"), collect_list(struct("dates")).as("dates") )

Hope this helps

Upvotes: 2

Related Questions