Reputation: 7886
Suppose I have the following Spark SQL data frame (i.e., org.apache.spark.sql.DataFrame
):
type individual
=================
cat fritz
cat felix
mouse mickey
mouse minnie
rabbit bugs
duck donald
duck daffy
cat sylvester
I want to transform this into a dataframe like this:
type individuals
================================
cat [fritz, felix, sylvester]
mouse [mickey, minnie]
rabbit [bugs]
duck [donald, daffy]
I know that I have to do something like:
myDataFrame.groupBy("type").agg(???)
What is the "???"? Is it something simple? Or is something as complicated as extending UserDefinedAggregateFunction
?
Upvotes: 0
Views: 749
Reputation: 22439
You can aggregate using collect_list
as follows:
val df = Seq(
("cat", "fritz"),
("cat", "felix"),
("mouse", "mickey"),
("mouse", "minnie"),
("rabbit", "bugs"),
("duck", "donald"),
("duck", "daffy"),
("cat", "sylvester")
).toDF(
"type", "individual"
)
// Aggregate grouped individuals into arrays
val groupedDF = df.groupBy($"type").agg(collect_list($"individual").as("individuals"))
groupedDF.show(truncate=false)
+------+-------------------------+
|type |individuals |
+------+-------------------------+
|cat |[fritz, felix, sylvester]|
|duck |[donald, daffy] |
|rabbit|[bugs] |
|mouse |[mickey, minnie] |
+------+-------------------------+
Upvotes: 1
Reputation: 21
if you dont mind using a little hql inside, you can go for the collect_list function https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions%28UDAF%29
for example: sparkContext.sql("select type, collect_list(individuals) as individuals from myDf group by type")
not sure if you can access it directly in spark though.
Upvotes: 0