Reputation: 63
I have a Spark DataFrame with the following schema.
counts
|-- index: string
|-- name: string
|-- count: long
I would like to sum the count
column for each index
and also find the maximum count
together with its corresponding name
. So each row would contain a unique index
with SUM(count)
, MAX(count)
and the corresponding name
.
I group the entries by index
and select the row with the highest count
by running the following query.
SELECT * FROM counts
WHERE (index, count)
IN (
SELECT index, MAX(count)
FROM counts
GROUP BY index
)
ORDER BY count DESC
This works fine and gives me a unique row per index
containing the row with the highest count
.
However, I would like to add a column that contains the sum of count
grouped by index
based on the original table.
I can sum the view counts and group them by index
with this query:
SELECT index, SUM(count)
FROM counts
GROUP BY index
But I would like to have a unified table containing the results of both queries (queried in an efficient way).
Any suggestions on this matter is much appreciated.
Upvotes: 0
Views: 1554
Reputation: 1269753
Just use window functions:
SELECT c.*
FROM (SELECT c.*, SUM(count) OVER (PARTITION BY index) as cnt,
ROW_NUMBER() OVER (PARTITION BY index ORDER BY count DESC) as seqnum
FROM counts c
) c
WHERE seqnum = 1;
This assumes you have other columns you want to keep. If you just want the max and the sum for each index
:
SELECT index, MAX(count), SUM(count)
FROM counts
GROUP BY index;
And index
is a bad name for a column, because it is a SQL keyword (think "create index").
Upvotes: 1