Redshift: Getting rank of a row, filtered by a condition

Question

Every time I add a row to a table, I want to know where it ranks in comparison with the table up to that point. This is easily done with the RANK() window function. However, I'm struggling to find a way to to discover where it ranks in comparison with the table up until that point filtered by a value.

As an example, I'm wanting to end up with this highly contrived table:

    date    |   name  | animal_bought | num_sloths_bought_before | num_camels_bought_before
------------+---------+---------------+--------------------------+--------------------------
 2014-09-01 | Vincent | sloth         | 0                        | 0
 2014-09-01 | Luis    | camel         | 0                        | 0
 2014-09-02 | Vincent | sloth         | 1                        | 0
 2014-09-02 | Luis    | camel         | 0                        | 1
 2014-09-02 | Kevin   | sloth         | 0                        | 0
 2014-09-03 | Vincent | camel         | 1                        | 0
 2014-09-04 | Deo     | camel         | 0                        | 0
 2014-09-04 | Vincent | sloth         | 2                        | 1
 2014-09-05 | Luis    | camel         | 0                        | 2
 2014-09-05 | Andrew  | sloth         | 0                        | 0

I was initially looking to see whether I could apply a filter to the window function (eg. RANK() OVER(PARTITION BY name WHERE animal_bought = 'sloth' ORDER BY date ASC) AS num_sloths_bought_before) but this isn't syntactically correct. I then tried adding a sub-query, as follows:

SELECT
  date,
  name,
  animal_bought,
  ( SELECT
      RANK() OVER(PARTITION BY name ORDER BY date ASC) - 1
    FROM this_table
    WHERE animal_bought = 'sloth'
  ) AS num_sloths_bought_before
FROM source_table

but Redshift threw this error:

ERROR:  This type of correlated subquery pattern is not supported yet

I've also tried putting the window function in a case statement (throws the same error) and calculating the ranks in a join query (not been able to make it work).

Gordon Linoff · Accepted Answer

Hmmm. I don't think this query would do what you want anyway:

SELECT date, name, animal_bought,
       (SELECT RANK() OVER(PARTITION BY name ORDER BY date ASC) - 1
        FROM this_table
        WHERE animal_bought = 'sloth'
       ) AS num_sloths_bought_before
FROM source_table

For a few reasons:

The use of rank() suggests that there is more than one row in this_table that matches animal_bought. Otherwise, you could use an aggregation function.
If there is only one row that matches the where clause, then the value is always 1, because the where clause is processed before the rank().
Your question only mentions one table but your query has two

Perhaps you just want rank() without a subquery?

SELECT date, name, animal_bought,
       RANK() OVER (PARTITION BY name, animal ORDER BY date ASC) - 1 as NumberBoughtBefore
FROM source_table;

If you want it for both animals, then don't use rank(), use cumulative sum:

SELECT date, name, animal_bought,
       sum(case when animal = 'sloth' then 1 else 0 end) over (partition by name order by date) as SlothsBefore,
       sum(case when animal = 'camel' then 1 else 0 end) over (partition by name order by date) as CamelsBefore
FROM source_table;

EDIT:

SELECT date, name, animal_bought,
       (sum(case when animal = 'sloth' then 1 else 0 end) over (partition by name order by date) -
        (case when animal = 'sloth' then 1 else 0 end)
       ) as SlothsBefore,
       (sum(case when animal = 'camel' then 1 else 0 end) over (partition by name order by date) -
        (case when animal = 'camel' then 1 else 0 end)
       ) as CamelsBefore
FROM source_table;

Redshift: Getting rank of a row, filtered by a condition

Answers (1)

Related Questions