Canovice
Canovice

Reputation: 10441

percent_rank() in BigQuery with conditional to include only certain rows

I previously posted this related question. The solution to this question works for the analytic function rank(), however it does not work for percent_rank() in the correct manner. To demonstrate, I have the following dummy table:

with 
  table as (
    select 'a' as category, 1 as num, 15 as num2, 7 as cutoff union all 
    select 'a' as category, 2 as num, 15 as num2, 7 as cutoff union all
    select 'a' as category, 3 as num, 5 as num2, 7 as cutoff union all
    select 'a' as category, 4 as num, 5 as num2, 7 as cutoff union all
    select 'a' as category, 5 as num, 5 as num2, 7 as cutoff union all
    select 'a' as category, 6 as num, 5 as num2, 7 as cutoff union all
    select 'a' as category, 7 as num, 5 as num2, 7 as cutoff union all
    select 'a' as category, 8 as num, 5 as num2, 7 as cutoff union all
    select 'a' as category, 9 as num, 5 as num2, 7 as cutoff union all
    select 'a' as category, 10 as num, 15 as num2, 7 as cutoff union all
    select 'a' as category, 11 as num, 15 as num2, 7 as cutoff union all
    select 'a' as category, 12 as num, 15 as num2, 7 as cutoff union all
    select 'a' as category, 13 as num, 15 as num2, 7 as cutoff union all
    select 'a' as category, 14 as num, 15 as num2, 7 as cutoff union all
    select 'a' as category, 15 as num, 15 as num2, 7 as cutoff union all
    select 'a' as category, 16 as num, 15 as num2, 7 as cutoff union all
    select 'a' as category, 17 as num, 5 as num2, 7 as cutoff union all
    select 'a' as category, 18 as num, 15 as num2, 7 as cutoff union all
    select 'a' as category, 19 as num, 15 as num2, 7 as cutoff union all
    select 'a' as category, 20 as num, 5 as num2, 7 as cutoff union all
    select 'a' as category, 21 as num, 5 as num2, 7 as cutoff 
  )

percent_rank()'s are needed for the num column. However, only rows where num2 > cutoff should be considered for the percentile rankings. I have attempted the following 2 ways to compute percentiles, with the output:

select
  *,
  if(num2 >= cutoff,
      percent_rank() over(
        partition by category
        order by num
      ), null) as pctile1,
  if(num2 >= cutoff,
      percent_rank() over(
        partition by category
        order by if (num2 >= cutoff, num, null) ASC
      ), null) as pctile2
from table
order by num asc

enter image description here

Both pctile1 and pctile2 are incorrect. To show why this is the case, look at row 10, which has pctile1 == 0.45 and pctile2 == 0.60. However, amongst qualifying values, this should be a lower percentile. There are only 2 qualifying values lower than num == 10 (that being 1 and 2), whereas there are many values above 10 that qualify (11 - 19). The correct percentile for num == 10, given the num2 and cutoff values, should be closer to 30%, since 10 is the 3rd lowest value amongst the 11 qualifying values.

Note that I should not filter table to remove the rows I am not percent_rank()'ing over, as I need to keep these rows.

Edit

I am not sure how to shrink image size but I am trying to do that currently.

Upvotes: 3

Views: 1611

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

I would just simply go with below option

#standardSQL
SELECT *,
  PERCENT_RANK() OVER(PARTITION BY category ORDER BY num) AS pctile
FROM table WHERE num2 >= cutoff
UNION ALL
SELECT *, NULL
FROM table WHERE num2 < cutoff
-- ORDER BY num

if to apply to sample data from your question - output is

Row category    num num2    cutoff  pctile  
1   a           1   15      7       0.0     
2   a           2   15      7       0.1     
3   a           3   5       7       null    
4   a           4   5       7       null    
5   a           5   5       7       null    
6   a           6   5       7       null    
7   a           7   5       7       null    
8   a           8   5       7       null    
9   a           9   5       7       null    
10  a           10  15      7       0.2     
11  a           11  15      7       0.3     
12  a           12  15      7       0.4     
13  a           13  15      7       0.5     
14  a           14  15      7       0.6     
15  a           15  15      7       0.7     
16  a           16  15      7       0.8     
17  a           17  5       7       null    
18  a           18  15      7       0.9     
19  a           19  15      7       1.0     
20  a           20  5       7       null    
21  a           21  5       7       null      

In my mind above is easy to read, but below is most likely want you wanted

SELECT *,
  IF(num2 >= cutoff, 
    PERCENT_RANK() OVER(PARTITION BY IF(num2 >= cutoff, category, NULL) ORDER BY num), 
    NULL) AS pctile
FROM table
-- ORDER BY num

obviously, with the same result as above

Upvotes: 3

Related Questions