Shlomi Schwartz
Shlomi Schwartz

Reputation: 8903

Athena get percentiles boundaries

Having the following schema of model's predictions over time per user:

timestamp user_id model_id version prediction
2022-06-22 05:29:36.344 1 model_a 1 [0.1226]
2022-06-22 05:29:41.307 1 model_a 1 [0.932]
... 1 model_a 1 ...
2022-06-22 05:29:43.511 2 model_a 1 [0.0226]
2022-06-22 05:29:43.870 2 model_a 1 [0.132]
... 2 model_a 1 ...

I would like to select the last prediction per user (by timestamp), then take the result vector and calculate the boundary values of 100 percentiles. I have come out with the following query that get the job done:

with preds as
(select user_id, last_prediction
from
    (select user_id, round(prediction[1],5) last_prediction, timestamp curr_ts, max(timestamp) over (partition by user_id) max_ts
    FROM "my-schema"."my-table"
    where date(timestamp) BETWEEN date('2022-08-08') AND date('2022-08-09')
    AND version = '1' AND model_id = 'model_a')
where curr_ts = max_ts),

with_ntiles as
    (select *,NTILE(100) OVER(ORDER BY last_prediction) calculated_ntile
    from preds)

select calculated_ntile, min(last_prediction) min_pred, max(last_prediction) max_pred
from with_ntiles
group by 1 order by 1 

Result:

# calculated_ntile min_pred max_pred
1 1 0.00172 0.00261
2 2 0.00263 0.00305
3 3 0.00305 0.00345
... ... ... ...

I'm looking for a more elegant and maybe a faster solution.

Upvotes: 2

Views: 175

Answers (1)

Guru Stron
Guru Stron

Reputation: 141665

One thing you can try is to change from windows function to ordinary group by with max_by aggregate for the first CTE:

-- sample data
WITH dataset (timestamp, user_id, model_id, version, prediction) AS (
    VALUES (timestamp '2022-06-22 05:29:36.344', 1, 'model_a', '1', array[0.1226]),
        (timestamp '2022-06-22 05:29:41.307', 1, 'model_a', '1', array[0.932]),
        (timestamp '2022-06-22 05:29:43.511', 2, 'model_a', '1', array[0.0226]),
        (timestamp '2022-06-22 05:29:43.870', 2, 'model_a', '1', array[0.132])
)

-- sample query
select user_id, 
    max_by(round(prediction[1],5), timestamp) last_prediction
FROM dataset
where date(timestamp) BETWEEN date('2022-06-21') AND date('2022-08-09')
    AND version = '1' AND model_id = 'model_a'
GROUP BY user_id;

Output:

user_id last_prediction
1 0.9320
2 0.1320

Upvotes: 2

Related Questions