Reputation: 1593
I have a table of transactions and would like to add a percentile column that specifies the percentile of that transaction in that month based on the amount column.
Take this smaller example with quartiles instead of percentiles:
Example Input:
id | month | amount
1 | 1 | 1
2 | 1 | 2
3 | 1 | 5
4 | 1 | 3
5 | 2 | 1
6 | 2 | 3
1 | 2 | 5
1 | 2 | 7
1 | 2 | 9
1 | 2 | 11
1 | 2 | 15
1 | 2 | 16
Example Output
id | month | amount | quartile
1 | 1 | 1 | 25
2 | 1 | 2 | 50
3 | 1 | 5 | 100
4 | 1 | 3 | 75
5 | 2 | 1 | 25
6 | 2 | 3 | 25
1 | 2 | 5 | 50
1 | 2 | 15 | 100
1 | 2 | 9 | 75
1 | 2 | 11 | 75
1 | 2 | 7 | 50
1 | 2 | 16 | 100
Currently, I use postgres's percentile_cont
function to determine the amount values of the cutoff points for different percentiles and then go through and do update the percentile column accordingly. Unfortunately, this approach is way too slow because I have many different months. Any ideas for how to do this more quickly, preferably combining the calculation of the percentile and the update in one SQL statement.
My code:
num_buckets = 10
for i in range(num_buckets):
decimal_percentile = (i+1)*(1.0/num_buckets)
prev_decimal_percentile = i*1.0/num_buckets
percentile = int(decimal_percentile*100)
cursor.execute("SELECT month,
percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC),
percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC)
FROM transactions GROUP BY month;",
(prev_decimal_percentile, decimal_percentile))
iter_cursor = connection.cursor()
for data in cursor:
iter_cursor.execute("UPDATE transactions SET percentile=%s
WHERE month = %s
AND amount >= %s AND amount <= %s;",
(percentile, data[0], data[1], data[2]))
Upvotes: 1
Views: 157
Reputation: 121754
You can do this in a single query, example for 4 buckets:
update transactions t
set percentile = calc_percentile
from (
select distinct on (month, amount)
id,
month,
amount,
calc_percentile
from transactions
join (
select
bucket,
month as calc_month,
percentile_cont(bucket*1.0/4) within group (order by amount asc) as calc_amount,
bucket*100/4 as calc_percentile
from transactions
cross join generate_series(1, 4) bucket
group by month, bucket
) s on month = calc_month and amount <= calc_amount
order by month, amount, calc_percentile
) s
where t.month = s.month and t.amount = s.amount;
Results:
select *
from transactions
order by month, amount;
id | month | amount | percentile
----+-------+--------+------------
1 | 1 | 1 | 25
2 | 1 | 2 | 50
4 | 1 | 3 | 75
3 | 1 | 5 | 100
5 | 2 | 1 | 25
6 | 2 | 3 | 25
1 | 2 | 5 | 50
1 | 2 | 7 | 50
1 | 2 | 9 | 75
1 | 2 | 11 | 75
1 | 2 | 15 | 100
1 | 2 | 16 | 100
(12 rows)
Btw, id
should be a primary key, then it could be used in joins for better performance.
Upvotes: 1