Reputation: 10441
I previously posted this related question. The solution to this question works for the analytic function rank()
, however it does not work for percent_rank()
in the correct manner. To demonstrate, I have the following dummy table:
with
table as (
select 'a' as category, 1 as num, 15 as num2, 7 as cutoff union all
select 'a' as category, 2 as num, 15 as num2, 7 as cutoff union all
select 'a' as category, 3 as num, 5 as num2, 7 as cutoff union all
select 'a' as category, 4 as num, 5 as num2, 7 as cutoff union all
select 'a' as category, 5 as num, 5 as num2, 7 as cutoff union all
select 'a' as category, 6 as num, 5 as num2, 7 as cutoff union all
select 'a' as category, 7 as num, 5 as num2, 7 as cutoff union all
select 'a' as category, 8 as num, 5 as num2, 7 as cutoff union all
select 'a' as category, 9 as num, 5 as num2, 7 as cutoff union all
select 'a' as category, 10 as num, 15 as num2, 7 as cutoff union all
select 'a' as category, 11 as num, 15 as num2, 7 as cutoff union all
select 'a' as category, 12 as num, 15 as num2, 7 as cutoff union all
select 'a' as category, 13 as num, 15 as num2, 7 as cutoff union all
select 'a' as category, 14 as num, 15 as num2, 7 as cutoff union all
select 'a' as category, 15 as num, 15 as num2, 7 as cutoff union all
select 'a' as category, 16 as num, 15 as num2, 7 as cutoff union all
select 'a' as category, 17 as num, 5 as num2, 7 as cutoff union all
select 'a' as category, 18 as num, 15 as num2, 7 as cutoff union all
select 'a' as category, 19 as num, 15 as num2, 7 as cutoff union all
select 'a' as category, 20 as num, 5 as num2, 7 as cutoff union all
select 'a' as category, 21 as num, 5 as num2, 7 as cutoff
)
percent_rank()
's are needed for the num
column. However, only rows where num2 > cutoff
should be considered for the percentile rankings. I have attempted the following 2 ways to compute percentiles, with the output:
select
*,
if(num2 >= cutoff,
percent_rank() over(
partition by category
order by num
), null) as pctile1,
if(num2 >= cutoff,
percent_rank() over(
partition by category
order by if (num2 >= cutoff, num, null) ASC
), null) as pctile2
from table
order by num asc
Both pctile1
and pctile2
are incorrect. To show why this is the case, look at row 10, which has pctile1 == 0.45
and pctile2 == 0.60
. However, amongst qualifying values, this should be a lower percentile. There are only 2 qualifying values lower than num == 10
(that being 1 and 2), whereas there are many values above 10 that qualify (11 - 19). The correct percentile for num == 10
, given the num2
and cutoff
values, should be closer to 30%, since 10
is the 3rd lowest value amongst the 11 qualifying values.
Note that I should not filter table to remove the rows I am not percent_rank()
'ing over, as I need to keep these rows.
I am not sure how to shrink image size but I am trying to do that currently.
Upvotes: 3
Views: 1611
Reputation: 173046
I would just simply go with below option
#standardSQL
SELECT *,
PERCENT_RANK() OVER(PARTITION BY category ORDER BY num) AS pctile
FROM table WHERE num2 >= cutoff
UNION ALL
SELECT *, NULL
FROM table WHERE num2 < cutoff
-- ORDER BY num
if to apply to sample data from your question - output is
Row category num num2 cutoff pctile
1 a 1 15 7 0.0
2 a 2 15 7 0.1
3 a 3 5 7 null
4 a 4 5 7 null
5 a 5 5 7 null
6 a 6 5 7 null
7 a 7 5 7 null
8 a 8 5 7 null
9 a 9 5 7 null
10 a 10 15 7 0.2
11 a 11 15 7 0.3
12 a 12 15 7 0.4
13 a 13 15 7 0.5
14 a 14 15 7 0.6
15 a 15 15 7 0.7
16 a 16 15 7 0.8
17 a 17 5 7 null
18 a 18 15 7 0.9
19 a 19 15 7 1.0
20 a 20 5 7 null
21 a 21 5 7 null
In my mind above is easy to read, but below is most likely want you wanted
SELECT *,
IF(num2 >= cutoff,
PERCENT_RANK() OVER(PARTITION BY IF(num2 >= cutoff, category, NULL) ORDER BY num),
NULL) AS pctile
FROM table
-- ORDER BY num
obviously, with the same result as above
Upvotes: 3