Spark SQL: find maximum and sum grouped by key

Question

I have a Spark DataFrame with the following schema.

counts
 |-- index: string
 |-- name: string
 |-- count: long

I would like to sum the count column for each index and also find the maximum count together with its corresponding name. So each row would contain a unique index with SUM(count), MAX(count) and the corresponding name.

I group the entries by index and select the row with the highest count by running the following query.

SELECT * FROM counts 
WHERE (index, count) 
IN (
    SELECT index, MAX(count) 
    FROM counts
    GROUP BY index
)
ORDER BY count DESC

This works fine and gives me a unique row per index containing the row with the highest count.

However, I would like to add a column that contains the sum of count grouped by index based on the original table.

I can sum the view counts and group them by index with this query:

SELECT index, SUM(count)
FROM counts
GROUP BY index

But I would like to have a unified table containing the results of both queries (queried in an efficient way).

Any suggestions on this matter is much appreciated.

Gordon Linoff · Accepted Answer

Just use window functions:

SELECT c.*
FROM (SELECT c.*, SUM(count) OVER (PARTITION BY index) as cnt,
             ROW_NUMBER() OVER (PARTITION BY index ORDER BY count DESC) as seqnum
      FROM counts c
     ) c
WHERE seqnum = 1;

This assumes you have other columns you want to keep. If you just want the max and the sum for each index:

SELECT index, MAX(count), SUM(count)
FROM counts
GROUP BY index;

And index is a bad name for a column, because it is a SQL keyword (think "create index").

Spark SQL: find maximum and sum grouped by key

Answers (1)

Related Questions