Reputation: 99
I'm using Java Spark and I have 1 Dataframe like this
+---+-----+------+
|id |color|datas |
+----------------+
|1 |blue |data1|
|1 |red |data2|
|1 |orange|data3|
|2 |black |data4|
|2 | |data5|
|2 |yellow| |
|3 |white |data7|
|3 | |data8|
+----------------+
I need to modify this dataframe to look like this :
+---+--------------------+---------------------+
|id |color |datas |
+----------------------------------------------+
|1 |[blue, red, orange] |[data1, data2, data3]|
|2 |[black, yellow] |[data4, data5] |
|3 |[white] |[data7, data8] |
+----------------------------------------------+
I want to merge the data to create an 'array' of the same column but from differents rows based on the 'id' column.
I'm able to do it throught UserDefinedAggregateFunction but I can only do it one column at a time and it takes too much time to process.
Thank you
Edit : I'm using Spark 1.6
Upvotes: 3
Views: 1250
Reputation: 61
The actual function that works for me is:
dataframe.groupBy("id").agg(collect_list("color").as("color"), collect_list("date").as("date") ) dataframe.createOrReplaceTempView("dataframe")
Then create a new query where you can use the struct()
dffinal = spark.sql(s"""SELECT struct(a.color) AS colors, struct(a.date) AS dates FROM dataframe a """)
Upvotes: 0
Reputation: 23099
you can group by "id" and then use collect_list
function to get the aggregated values.
dataframe.groupBy("id").agg(collect_list(struct("color")).as("color"), collect_list(struct("dates")).as("dates") )
Hope this helps
Upvotes: 2