Reputation: 1975
How can I group a dataset of sequences by the first value of each sequence in SQL?
For example, I have the following dataset
id name key metric
1 alice a 0 <- key = 'a', start of a sequence
2 alice b 1
3 alice b 1
-----------------
4 alice a 1 <- key = 'a', start of a sequence
5 alice b 0
6 alice b 0
7 alice b 0
-----------------
8 bob a 1 <- key = 'a', start of a sequence
9 bob b 1
-----------------
10 bob a 0 <- key = 'a', start of a sequence
Rows with key = 'a'
start a new group. I want to, for example, sum the metrics for all the subsequent rows till I reach another key = 'a'
or another name
.
The dataset is sorted by id
.
The final result should be this:
id name metric
1 alice 2
4 alice 1
8 bob 2
10 bob 0
Here's the equivalent operation in JavaScript, but I want to be able to get the same result by a SQL query.
data.reduce((acc, a) => {
if(a.key === 'a'){
// key = 'a' starts a new group
return [{id: a.id, name: a.name, metric: a.metric}].concat(acc)
} else {
// because the data is sorted,
// all the subsequent rows with key = 'b' belong to the latest group
const [head, ...tail] = acc
const head_updated = {...head, metric: head.metric + a.metric}
return [head_updated, ...tail]
}
}, [])
.reverse()
Sample SQL dataset:
with dataset as (
select
1 as id
, 'alice' as name
, 'a' as key
, 0 as metric
union select
2 as id
, 'alice' as name
, 'b' as key
, 1 as metric
union select
3 as id
, 'alice' as name
, 'b' as key
, 1 as metric
union select
4 as id
, 'alice' as name
, 'a' as key
, 1 as metric
union select
5 as id
, 'alice' as name
, 'b' as key
, 0 as metric
union select
6 as id
, 'alice' as name
, 'b' as key
, 0 as metric
union select
7 as id
, 'alice' as name
, 'b' as key
, 0 as metric
union select
8 as id
, 'bob' as name
, 'a' as key
, 1 as metric
union select
9 as id
, 'bob' as name
, 'b' as key
, 1 as metric
union select
10 as id
, 'bob' as name
, 'a' as key
, 0 as metric
)
select * from dataset
order by name, id
Upvotes: 2
Views: 237
Reputation: 1372
Based on what OP wrote in the comments, the query must indeed be like this:
SELECT MAX(t.head_id) AS id,
t.head_name AS name,
SUM(t.metric) AS metric
FROM (
SELECT SUM(CASE WHEN key = 'a' THEN 1 END) OVER (PARTITION BY name ORDER BY id) AS group_id,
CASE WHEN key = 'a' THEN id END AS head_id,
name AS head_name,
metric
FROM dataset
) t
GROUP BY t.head_name, t.group_id
However, if you can add an index by name and id, it really improves the performance of the query. This because it doesn't require a sort operation before aggregating.
Testing with a table with a million rows, this is the output of explain analyse without index:
HashAggregate (cost=177154.34..177158.34 rows=400 width=25) (actual time=3374.878..3489.755 rows=400000 loops=1)
Group Key: dataset.name, sum(CASE WHEN (dataset.key = 'a'::text) THEN 1 ELSE NULL::integer END) OVER (?)
-> WindowAgg (cost=132154.34..157154.34 rows=1000000 width=25) (actual time=1920.338..3000.218 rows=1000000 loops=1)
-> Sort (cost=132154.34..134654.34 rows=1000000 width=15) (actual time=1920.323..2232.936 rows=1000000 loops=1)
Sort Key: dataset.name, dataset.id
Sort Method: external merge Disk: 28192kB
-> Seq Scan on dataset (cost=0.00..15406.00 rows=1000000 width=15) (actual time=0.020..172.746 rows=1000000 loops=1)
Planning Time: 0.870 ms
Execution Time: 3516.726 ms
By creating the index, the query plan changes to the following:
Index:
CREATE INDEX dataset__name_id__idx ON dataset(name, id);
Query Plan:
HashAggregate (cost=90169.90..90173.90 rows=400 width=25) (actual time=1464.759..1567.778 rows=400000 loops=1)
Group Key: dataset.name, sum(CASE WHEN (dataset.key = 'a'::text) THEN 1 ELSE NULL::integer END) OVER (?)
-> WindowAgg (cost=0.42..70169.90 rows=1000000 width=25) (actual time=0.033..1077.362 rows=1000000 loops=1)
-> Index Scan using dataset__name_id__idx on dataset (cost=0.42..47669.90 rows=1000000 width=15) (actual time=0.022..225.445 rows=1000000 loops=1)
Planning Time: 0.131 ms
Execution Time: 1590.040 ms
Based on your javascript code, you don't want to partition the window by name
, nor group by name
in the outer query. Without that, you actually end with a better query that allows you to use only the primary index, assuming that the id
column is indexed.
SELECT t.head_id AS id,
MAX(t.head_name) AS name,
SUM(t.metric) AS metric
FROM (
SELECT MAX(CASE WHEN key = 'a' THEN id END) OVER (ORDER BY id) AS head_id,
CASE WHEN key = 'a' THEN name END AS head_name,
metric
FROM dataset
) t
GROUP BY t.head_id
Here is the query plan for a dataset
with 1 million rows:
HashAggregate (cost=68889.43..68891.43 rows=200 width=44) (actual time=1277.469..1393.709 rows=400000 loops=1)
Group Key: max(CASE WHEN (dataset.key = 'a'::text) THEN dataset.id ELSE NULL::integer END) OVER (?)
-> WindowAgg (cost=0.42..51389.43 rows=1000000 width=44) (actual time=0.025..927.595 rows=1000000 loops=1)
-> Index Scan using dataset_pkey on dataset (cost=0.42..31389.42 rows=1000000 width=15) (actual time=0.017..209.657 rows=1000000 loops=1)
Planning Time: 0.127 ms
Execution Time: 1411.975 ms
Upvotes: 1
Reputation: 164174
You can use the window function sum()
to create the groups and then aggregate:
select min(id) id, name, sum(metric) metric
from (
select *, sum((key = 'a')::int) over (partition by name order by id) grp
from dataset
) t
group by name, grp
order by id
See the demo.
Results:
> id | name | metric
> -: | :---- | -----:
> 1 | alice | 2
> 4 | alice | 1
> 8 | bob | 2
> 10 | bob | 0
Upvotes: 2