Reputation: 39
I have a spark data frame in scala like :
URL Browser
A Chrome
B Chrome
C Firefox
A Chrome
A Firefox
A Opera
A Chrome
B Chrome
B Firefox
C Tor
The URL column has a wide range of data, but the Browser column has a limited set. I want to aggregate on the URL column and get the top counts for each browser in a list in descending order, like :
URL FrequentlyUsedBrowser
A [(Chrome,3),(Firefox,1),(Opera,1)]
B [(Chrome,2),(Firefox,1)]
C [(Chrome,1),(Tor,1)]
I have been writing SQL for it to use window partition to get the count as one entry for each browser, but not been able to get it into a list.
This is for a google data proc cluster running Spark 2.4 with Scala 2.11
Upvotes: 1
Views: 54
Reputation: 39
I made a modification to @chlebek's answer, adding an order by so it works perfectly for me now. The list is also sorted now. Thanks !!
df.withColumn("num", lit(1))
.groupBy('url, 'browser)
.agg(sum('num)
.as("num"))
.orderBy('num.desc)
.select('url, format_string("(%s)",concat_ws(",", 'browser, 'num))
.as("dst"))
.groupBy('url)
.agg(collect_list('dst))
.toDF("URL","FrequentlyUsedBrowser")
.orderBy('url).show(false)
Upvotes: 0
Reputation: 2451
You can achieve it by using collect_list
and sort by sort_array
.
df.withColumn("num", lit(1))
.groupBy('url, 'browser)
.agg(sum('num).as("num"))
.select('url, format_string("(%s)",concat_ws(",", 'browser, 'num)).as("dst"))
.groupBy('url)
.agg(sort_array(collect_list('dst))).toDF("URL","FrequentlyUsedBrowser")
.orderBy('url)
.show(false)
+---+-------+
|url|browser|
+---+-------+
| A| Chrome|
| B| Chrome|
| C|Firefox|
| A| Chrome|
| A|Firefox|
| A| Opera|
| A| Chrome|
| B| Chrome|
| B|Firefox|
| C| Tor|
+---+-------+
+---+------------------------------------+
|URL|FrequentlyUsedBrowser |
+---+------------------------------------+
|A |[(Chrome,3), (Firefox,1), (Opera,1)]|
|B |[(Chrome,2), (Firefox,1)] |
|C |[(Firefox,1), (Tor,1)] |
+---+------------------------------------+
Upvotes: 1