Reputation: 8903
Having the following schema of model's predictions over time per user:
timestamp | user_id | model_id | version | prediction |
---|---|---|---|---|
2022-06-22 05:29:36.344 | 1 | model_a | 1 | [0.1226] |
2022-06-22 05:29:41.307 | 1 | model_a | 1 | [0.932] |
... | 1 | model_a | 1 | ... |
2022-06-22 05:29:43.511 | 2 | model_a | 1 | [0.0226] |
2022-06-22 05:29:43.870 | 2 | model_a | 1 | [0.132] |
... | 2 | model_a | 1 | ... |
I would like to select the last prediction per user (by timestamp), then take the result vector and calculate the boundary values of 100 percentiles. I have come out with the following query that get the job done:
with preds as
(select user_id, last_prediction
from
(select user_id, round(prediction[1],5) last_prediction, timestamp curr_ts, max(timestamp) over (partition by user_id) max_ts
FROM "my-schema"."my-table"
where date(timestamp) BETWEEN date('2022-08-08') AND date('2022-08-09')
AND version = '1' AND model_id = 'model_a')
where curr_ts = max_ts),
with_ntiles as
(select *,NTILE(100) OVER(ORDER BY last_prediction) calculated_ntile
from preds)
select calculated_ntile, min(last_prediction) min_pred, max(last_prediction) max_pred
from with_ntiles
group by 1 order by 1
Result:
# | calculated_ntile | min_pred | max_pred |
---|---|---|---|
1 | 1 | 0.00172 | 0.00261 |
2 | 2 | 0.00263 | 0.00305 |
3 | 3 | 0.00305 | 0.00345 |
... | ... | ... | ... |
I'm looking for a more elegant and maybe a faster solution.
Upvotes: 2
Views: 175
Reputation: 141665
One thing you can try is to change from windows function to ordinary group by with max_by
aggregate for the first CTE:
-- sample data
WITH dataset (timestamp, user_id, model_id, version, prediction) AS (
VALUES (timestamp '2022-06-22 05:29:36.344', 1, 'model_a', '1', array[0.1226]),
(timestamp '2022-06-22 05:29:41.307', 1, 'model_a', '1', array[0.932]),
(timestamp '2022-06-22 05:29:43.511', 2, 'model_a', '1', array[0.0226]),
(timestamp '2022-06-22 05:29:43.870', 2, 'model_a', '1', array[0.132])
)
-- sample query
select user_id,
max_by(round(prediction[1],5), timestamp) last_prediction
FROM dataset
where date(timestamp) BETWEEN date('2022-06-21') AND date('2022-08-09')
AND version = '1' AND model_id = 'model_a'
GROUP BY user_id;
Output:
user_id | last_prediction |
---|---|
1 | 0.9320 |
2 | 0.1320 |
Upvotes: 2