Nuby
Nuby

Reputation: 2386

Any way to clean up this complex aggregate function?

I'm using Postgres, but I'm fine with a generic solution to this (and you can ignore the text aggregation if you want).

With the following table setup:

id  ref_id  user_id  start_time  duration text
100 2000    1        15000       200      hello
101 2000    1        16000       300      world
102 2000    1        22000       400      foo
103 2000    2        17000       500      bar
104 2000    2        21000       600      baz

I'm trying to use the minimum integer timestamp (ms or whatever) and use that as the basis for windowing my query. The idea being that the absolute minimum across all users it the window for the conversation. So if i looked at 5000-unit windows, I'd have 2 blocks (15000-20000, 20001-25000).

Expected output:

ref_id  user_id  block  sum(duration) count(*) text
2000    1        1      500           2        hello world
2000    1        2      400           1        foo
2000    2        1      500           1        bar
2000    2        2      600           1        baz

I've gone through multiple iterations of self join and window, but I can't get it more compact than three nested queries.

select user_id, block, string_agg(text, ' '), sum(duration)
from 
    (select user_id, FLOOR((start_time - t1.st)/5000) as block, start_time, text, duration
    from table t0
    inner join 
        (select id, min(start_time) as st from table group by 1) as t1
        on t0.ref_id = t1.ref_id
    order by 1, 2, 3) t2
group by 1, 2;

I think the reason I need the third is because I couldn't order by start_time, and it matters for some of aggregations (text concatenation, for example).

I'd love some help if anyone's an expert!

Upvotes: 1

Views: 110

Answers (2)

Haleemur Ali
Haleemur Ali

Reputation: 28243

Assuming the field called id in the example sql is a typo and should be user_id as specified in the example data

The simplest way to write this query would be:

SELECT
  user_id
, FLOOR((start_time - mst)/5000) + 1 block
, SUM(duration)
, STRING_AGG("text", ' ')
FROM mytable, (SELECT MIN(start_time) mst FROM mytable) minst
GROUP BY 1, 2

You don't need any window functions or nested queries. Here, the , separating the two relations in the FROM clause performs a CROSS JOIN (or Cartesian Product) implicitly. This is how most people start writing joins in SQL queries.

However, using , to implicitly mean CROSS JOIN is often perceived as bad style, so instead the above query can be written as:

SELECT
  user_id
, FLOOR((start_time - mst)/5000) + 1 block
, SUM(duration)
, STRING_AGG("text", ' ')
FROM mytable CROSS JOIN (SELECT MIN(start_time) mst FROM mytable) minst
GROUP BY 1, 2

Also, from your example sql attempt and desired output, it seems that you'll have to add 1 to FLOOR((start_time - minst)/5000) to get the blocks to start from 1, 2, ...

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269703

I'm not surer why that "block" is so important, but you can remove one level of subqueries:

select user_id, FLOOR((start_time - minstart)/5000),
       string_agg(text, ' '), sum(duration)
from (select user_id,  as block, 
             start_time, text, duration,
             min(start_time) over (partition by id) as minstart
      from table t0
      ) t
group by 1, 2;

Upvotes: 1

Related Questions