sshet
sshet

Reputation: 1160

how to calculate percentile in postgres

I Have table called timings where we are storing 1 million response timings for load testing , now we need to divide this data into 100 groups i.e. - first 500 records as one group and so on , and calculate percentile of each group , rather than average.

so far i tried this query

Select quartile
     , avg(data) 
     , max(data) 
  FROM (

        SELECT data
             , ntile(500) over (order by data) as quartile
          FROM data
       ) x
 GROUP BY quartile
 ORDER BY quartile

but how do i have find the percentile

Upvotes: 9

Views: 20854

Answers (2)

Paul A Jungwirth
Paul A Jungwirth

Reputation: 24541

Usually, if you want to know the percentile, you are safer using cume_dist than ntile. That is because ntile behaves strangely when given few inputs. Consider:

=# select v, 
          ntile(100) OVER (ORDER BY v),
          cume_dist() OVER (ORDER BY v)
   FROM (VALUES (1), (2), (4), (4)) x(v);

 v | ntile | cume_dist 
---+-------+-----------
 1 |     1 |      0.25
 2 |     2 |       0.5
 4 |     3 |         1
 4 |     4 |         1

You can see that ntile only uses the first 4 out of 100 buckets, where cume_dist always gives you a number from 0 to 1. So if you want to find out the 99th percentile, you can just throw away everything with a cume_dist under 0.99 and take the smallest v from what's left.

If you are on Postgres 9.4+, then percentile_cont and percentile_disc make it even easier, because you don't have to construct the buckets yourself. The former even gives you interpolation between values, which again may be useful if you have a small data set.

Upvotes: 13

khampson
khampson

Reputation: 15296

Edit:

Please note that since I originally answered this question, Postgres has gotten additional aggregate functions to help with this. See percentile_disc and percentile_cont here. These were introduced in 9.4.

Original Answer:

ntile is how one calculates percentiles (among other n-tiles, such as quartile, decile, etc.).

ntile groups the table into the specified number of buckets as equally as possible. If you specified 4 buckets, that would be a quartile. 10 would be a decile.

For percentile, you would set the number of buckets to be 100.

I'm not sure where the 500 comes in here... if you want to determine which percentile your data is in (i.e. divide the million timings as equally as possible into 100 buckets), you would use ntile with an argument of 100, and the groups would have more than 500 entries.

If you don't care about avg nor max, you can drop a bunch from your query. So it would look something like this:

SELECT data, ntile(100) over (order by data) AS percentile
FROM data
ORDER BY data

Upvotes: 10

Related Questions