Reputation: 187
I'm using the PERCENT_RANK() function to get percentile metrics for a given set of data. Here's the query:
WITH time_values AS (
SELECT
var,
(end_time - start_time) * 1.0 / 3600000000 AS num_hours,
PERCENT_RANK() OVER (PARTITION BY var1 ORDER BY num_hours) AS pct_rank
FROM table
WHERE
start_time >= 1493596800000000
AND end_time < 1493683200000000
)
SELECT
var,
pct_rank,
num_hours
FROM time_values
WHERE pct_rank IN (0.25, 0.5, 0.8, 0.99)
ORDER BY 1, 2;
However, given the way PERCENT_RANK() works, I'm not going to get an exact match for each of the percentiles I care about, so the output will look something like:
var | pct_rank | num_hours
-----+----------+------------------
a | 0.25 | 31.752826672222
a | 0.5 | 171.844016125555
b | 0.25 | 230.704589953055
b | 0.5 | 246.269648327222
I'm looking for a way to return values for each percentile I care about, or the value nearest that percentile if an exact match isn't found. Is this doable?
Upvotes: 1
Views: 808
Reputation: 4208
You can rank records and then select max value with the rank before percentile cutoff:
WITH time_values AS (
SELECT
var,
(end_time - start_time) * 1.0 / 3600000000 AS num_hours,
row_number() OVER (PARTITION BY var1 ORDER BY num_hours) AS rank,
count(1) OVER (PARTITION BY var1) AS records
FROM table
WHERE
start_time >= 1493596800000000
AND end_time < 1493683200000000
)
SELECT
var,
max(case when 1.0*rank/count<0.25 then num_hours end) as percentile_25,
max(case when 1.0*rank/count<0.50 then num_hours end) as percentile_50,
max(case when 1.0*rank/count<0.80 then num_hours end) as percentile_80,
max(case when 1.0*rank/count<0.99 then num_hours end) as percentile_99
FROM time_values
ORDER BY 1;
or do the same with PERCENT_RANK()
output, and if you really want to have the output row-wise not column-wise then just union the last step results to get the desired structure
Upvotes: 2