gatoatigrado
gatoatigrado

Reputation: 16860

psql/redshift: is there a way to use window functions like FIRST_VALUE in a GROUP BY expression?

motivation: This seems kind of terrible, but I'm trying to write string_agg in Redshift using multiple queries, which will coalesce neighboring rows. My maximum group size isn't that big, so I think the query would only run for a few iterations. I've managed to preprocess my data in a form that's like,

key   | merge index | value
a     | 0           | foo
a     | 0           | bar
a     | 1           | baz
b     | 0           | fandangle

in one step, everything with the same (key, merge_index) should be concatenated, so we get,

key   | merge index | value
a     | 0           | foo, bar
a     | 1           | baz
b     | 0           | fandangle

I want to use first_value and last_value in a GROUP BY statement like so,

SELECT key,
       merge_index,
       FIRST_VALUE(value) || COALESCE((', ' || NTH_VALUE(value, 2)), '')
GROUP BY key, merge_index;

but, of course, you can't do that because FIRST_VALUE and NTH_VALUE are window functions, not aggregate functions.

question: Why can't I use FIRST_VALUE and friends in a GROUP BY group?

note: It works functionally to do a SELECT DISTINCT, omit the GROUP BY, and use the relevant OVER (PARTITION BY key, merge_index) windows, but I can't imagine this is efficient if it's trying to deduplicate the entire result table. I also realize I could do more preprocessing and add a column like left_or_right which indicates which side it's trying to merge, and then use a left join. That also doesn't seem too efficient, but maybe it's not bad.

Upvotes: 4

Views: 4846

Answers (2)

Aaron Bannin
Aaron Bannin

Reputation: 127

I like David's queries, but he didn't get into why:

Window functions are last part of the query to be executed, after the grouping and ordering. Because of this, a window function always outputs one value per record in the final data set. You can use aggregates inside window functions, but not window functions inside aggregates. To achieve your goal, you need to do another pass over the data set to aggregate, which is accomplished with a subquery.

Upvotes: 3

David Faber
David Faber

Reputation: 12485

Have you tried something like the following? This way you can avoid FIRST_VALUE() and NTH_VALUE() as well as aggregation:

WITH p AS (
  SELECT key, merge, value
       , ROW_NUMBER() OVER ( PARTITION BY key, merge ) AS rn
    FROM mytable
)
SELECT p1.key, p1.merge, p1.value || p1.value || COALESCE(',' || p2.value, '')
  FROM p p1 LEFT JOIN p p2
    ON p1.key = p2.key
   AND p1.merge = p2.merge
   AND p2.rn = 2
 WHERE p1.rn = 1

Please see SQL Fiddle demo here. Yes, I did use Postgres 9 for the fiddle; I couldn't get a connection on 8 (but I don't think I'm using any features of 9).

Alternately, you might use the following and avoid a self-join:

WITH p AS (
  SELECT key, merge, value
       , LEAD(value) OVER ( PARTITION BY key, merge ) AS next_value
       , ROW_NUMBER() OVER ( PARTITION BY key, merge ) AS rn
    FROM mytable
)
SELECT key, merge, value || COALESCE(',' || next_value, '')
  FROM p
 WHERE rn = 1

SQL Fiddle here. If you knew in advance how many values you needed to concatenate, you could make multiple calls to LEAD() with increasing offset values (more SQL Fiddle):

WITH p AS (
  SELECT key, merge, value
       , LEAD(value) OVER ( PARTITION BY key, merge ) AS next_value
       , LEAD(value,2) OVER ( PARTITION BY key, merge ) AS n2_value
       , LEAD(value,3) OVER ( PARTITION BY key, merge ) AS n3_value
       , ROW_NUMBER() OVER ( PARTITION BY key, merge ) AS rn
    FROM mytable
)

Upvotes: 2

Related Questions