Paul Reiners
Paul Reiners

Reputation: 7886

Aggregating into a list

Suppose I have the following Spark SQL data frame (i.e., org.apache.spark.sql.DataFrame):

 type   individual
 =================
 cat    fritz
 cat    felix
 mouse  mickey
 mouse  minnie
 rabbit bugs
 duck   donald
 duck   daffy
 cat    sylvester

I want to transform this into a dataframe like this:

 type   individuals
 ================================
 cat    [fritz, felix, sylvester]
 mouse  [mickey, minnie]
 rabbit [bugs]
 duck   [donald, daffy]

I know that I have to do something like:

 myDataFrame.groupBy("type").agg(???)

What is the "???"? Is it something simple? Or is something as complicated as extending UserDefinedAggregateFunction?

Upvotes: 0

Views: 749

Answers (2)

Leo C
Leo C

Reputation: 22439

You can aggregate using collect_list as follows:

val df = Seq(
  ("cat", "fritz"),
  ("cat", "felix"),
  ("mouse", "mickey"),
  ("mouse", "minnie"),
  ("rabbit", "bugs"),
  ("duck", "donald"),
  ("duck", "daffy"),
  ("cat", "sylvester")
).toDF(
  "type", "individual"
)

// Aggregate grouped individuals into arrays
val groupedDF = df.groupBy($"type").agg(collect_list($"individual").as("individuals"))

groupedDF.show(truncate=false)
+------+-------------------------+
|type  |individuals              |
+------+-------------------------+
|cat   |[fritz, felix, sylvester]|
|duck  |[donald, daffy]          |
|rabbit|[bugs]                   |
|mouse |[mickey, minnie]         |
+------+-------------------------+

Upvotes: 1

tumav
tumav

Reputation: 21

if you dont mind using a little hql inside, you can go for the collect_list function https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions%28UDAF%29

for example: sparkContext.sql("select type, collect_list(individuals) as individuals from myDf group by type")

not sure if you can access it directly in spark though.

Upvotes: 0

Related Questions