Athena get percentiles boundaries

Question

Having the following schema of model's predictions over time per user:

timestamp	user_id	model_id	version	prediction
2022-06-22 05:29:36.344	1	model_a	1	[0.1226]
2022-06-22 05:29:41.307	1	model_a	1	[0.932]
...	1	model_a	1	...
2022-06-22 05:29:43.511	2	model_a	1	[0.0226]
2022-06-22 05:29:43.870	2	model_a	1	[0.132]
...	2	model_a	1	...

I would like to select the last prediction per user (by timestamp), then take the result vector and calculate the boundary values of 100 percentiles. I have come out with the following query that get the job done:

with preds as
(select user_id, last_prediction
from
    (select user_id, round(prediction[1],5) last_prediction, timestamp curr_ts, max(timestamp) over (partition by user_id) max_ts
    FROM "my-schema"."my-table"
    where date(timestamp) BETWEEN date('2022-08-08') AND date('2022-08-09')
    AND version = '1' AND model_id = 'model_a')
where curr_ts = max_ts),

with_ntiles as
    (select *,NTILE(100) OVER(ORDER BY last_prediction) calculated_ntile
    from preds)

select calculated_ntile, min(last_prediction) min_pred, max(last_prediction) max_pred
from with_ntiles
group by 1 order by 1

Result:

#	calculated_ntile	min_pred	max_pred
1	1	0.00172	0.00261
2	2	0.00263	0.00305
3	3	0.00305	0.00345
...	...	...	...

I'm looking for a more elegant and maybe a faster solution.

Athena get percentiles boundaries

Answers (1)

Related Questions

user_id	last_prediction
1	0.9320
2	0.1320