Python Postgres Set Column to be Percentile

Question

I have a table of transactions and would like to add a percentile column that specifies the percentile of that transaction in that month based on the amount column.

Take this smaller example with quartiles instead of percentiles:

Example Input:

id | month | amount
1  |   1   |   1
2  |   1   |   2
3  |   1   |   5
4  |   1   |   3
5  |   2   |   1
6  |   2   |   3
1  |   2   |   5
1  |   2   |   7
1  |   2   |   9
1  |   2   |   11
1  |   2   |   15
1  |   2   |   16

Example Output

id | month | amount |  quartile
1  |   1   |   1    |      25
2  |   1   |   2    |      50
3  |   1   |   5    |      100
4  |   1   |   3    |      75
5  |   2   |   1    |      25
6  |   2   |   3    |      25
1  |   2   |   5    |      50
1  |   2   |   15   |      100
1  |   2   |   9    |      75
1  |   2   |   11   |      75
1  |   2   |   7    |      50
1  |   2   |   16   |      100

Currently, I use postgres's percentile_cont function to determine the amount values of the cutoff points for different percentiles and then go through and do update the percentile column accordingly. Unfortunately, this approach is way too slow because I have many different months. Any ideas for how to do this more quickly, preferably combining the calculation of the percentile and the update in one SQL statement.

My code:

num_buckets = 10

for i in range(num_buckets):
    decimal_percentile = (i+1)*(1.0/num_buckets)
    prev_decimal_percentile = i*1.0/num_buckets
    percentile = int(decimal_percentile*100)
    cursor.execute("SELECT month, 
                           percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC), 
                           percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC) 
                     FROM transactions GROUP BY month;", 
                     (prev_decimal_percentile, decimal_percentile))
    iter_cursor = connection.cursor()
    for data in cursor:
        iter_cursor.execute("UPDATE transactions SET percentile=%s 
                             WHERE month = %s 
                                   AND amount >= %s AND amount <= %s;", 
                            (percentile, data[0], data[1], data[2]))

klin · Accepted Answer

You can do this in a single query, example for 4 buckets:

update transactions t
set percentile = calc_percentile
from (
    select distinct on (month, amount) 
        id, 
        month, 
        amount, 
        calc_percentile
    from transactions
    join (
        select 
            bucket,
            month as calc_month, 
            percentile_cont(bucket*1.0/4) within group (order by amount asc) as calc_amount,
            bucket*100/4 as calc_percentile
        from transactions 
        cross join generate_series(1, 4) bucket
        group by month, bucket
        ) s on month = calc_month and amount <= calc_amount
    order by month, amount, calc_percentile 
    ) s
where t.month = s.month and t.amount = s.amount;

Results:

select *
from transactions
order by month, amount;

 id | month | amount | percentile 
----+-------+--------+------------
  1 |     1 |      1 |         25
  2 |     1 |      2 |         50
  4 |     1 |      3 |         75
  3 |     1 |      5 |        100
  5 |     2 |      1 |         25
  6 |     2 |      3 |         25
  1 |     2 |      5 |         50
  1 |     2 |      7 |         50
  1 |     2 |      9 |         75
  1 |     2 |     11 |         75
  1 |     2 |     15 |        100
  1 |     2 |     16 |        100
(12 rows)

Btw, id should be a primary key, then it could be used in joins for better performance.

DbFiddle.

Python Postgres Set Column to be Percentile

Answers (1)

Related Questions