Scott Borden
Scott Borden

Reputation: 187

In Redshift, is there a way to filter for records nearest to a given value?

I'm using the PERCENT_RANK() function to get percentile metrics for a given set of data. Here's the query:

WITH time_values AS (
    SELECT
        var,
        (end_time - start_time) * 1.0 / 3600000000 AS num_hours,
        PERCENT_RANK() OVER (PARTITION BY var1 ORDER BY num_hours) AS pct_rank
    FROM table
    WHERE
        start_time >= 1493596800000000
        AND end_time < 1493683200000000
)
SELECT
    var,
    pct_rank,
    num_hours
FROM time_values
WHERE pct_rank IN (0.25, 0.5, 0.8, 0.99)
ORDER BY 1, 2;

However, given the way PERCENT_RANK() works, I'm not going to get an exact match for each of the percentiles I care about, so the output will look something like:

 var | pct_rank |    num_hours
-----+----------+------------------
   a |     0.25 |  31.752826672222
   a |      0.5 | 171.844016125555
   b |     0.25 | 230.704589953055
   b |      0.5 | 246.269648327222

I'm looking for a way to return values for each percentile I care about, or the value nearest that percentile if an exact match isn't found. Is this doable?

Upvotes: 1

Views: 808

Answers (1)

AlexYes
AlexYes

Reputation: 4208

You can rank records and then select max value with the rank before percentile cutoff:

WITH time_values AS (
    SELECT
        var,
        (end_time - start_time) * 1.0 / 3600000000 AS num_hours,
        row_number() OVER (PARTITION BY var1 ORDER BY num_hours) AS rank,
        count(1) OVER (PARTITION BY var1) AS records
    FROM table
    WHERE
        start_time >= 1493596800000000
        AND end_time < 1493683200000000
)
SELECT
    var,
    max(case when 1.0*rank/count<0.25 then num_hours end) as percentile_25,
    max(case when 1.0*rank/count<0.50 then num_hours end) as percentile_50,
    max(case when 1.0*rank/count<0.80 then num_hours end) as percentile_80,
    max(case when 1.0*rank/count<0.99 then num_hours end) as percentile_99
FROM time_values
ORDER BY 1;

or do the same with PERCENT_RANK() output, and if you really want to have the output row-wise not column-wise then just union the last step results to get the desired structure

Upvotes: 2

Related Questions