William
William

Reputation: 2191

Percentile aggregate for SQL Server 2008 R2

I'm using SQL Server 2008 R2. I need to compute a percentile value per group, something like:

SELECT id,
       PCTL(0.9, x) -- for the 90th percentile
FROM my_table
GROUP BY id
ORDER BY id

For example, given this DDL (fiddle) ---

CREATE TABLE my_table (id INT, x REAL);

INSERT INTO my_table
VALUES (7, 0.164595), (5, 0.671311), (7, 0.0118385), (6, 0.704592), (3, 0.633521), (3, 0.337268), (0, 0.54739), (6, 0.312282), (0, 0.220618), (7, 0.214973), (6, 0.410768), (7, 0.151572), (7, 0.0639506), (5, 0.339075), (1, 0.284094), (2, 0.126722), (2, 0.870079), (3, 0.369366), (1, 0.6687), (5, 0.199456), (5, 0.0296715), (1, 0.330339), (9, 0.0000459612), (5, 0.391947), (3, 0.753965), (8, 0.334207), (7, 0.583357), (3, 0.326951), (4, 0.207057), (2, 0.258463), (2, 0.0532811), (1, 0.751584), (7, 0.592624), (7, 0.673506), (5, 0.44764), (6, 0.733737), (5, 0.141215), (7, 0.222452), (3, 0.597019), (1, 0.293901), (4, 0.516213), (7, 0.498336), (6, 0.410461), (2, 0.32211), (1, 0.466735), (5, 0.720456), (8, 0.000428383), (3, 0.46085), (0, 0.402963), (7, 0.677002), (0, 0.400122), (1, 0.762357), (9, 0.158455), (7, 0.359723), (4, 0.225914), (7, 0.795345), (6, 0.902261), (2, 0.69533), (8, 0.593605), (6, 0.266233), (0, 0.917188), (9, 0.96353), (2, 0.577035), (8, 0.945236), (3, 0.257776), (4, 0.560569), (0, 0.838326), (2, 0.660338), (2, 0.537372), (8, 0.33806), (0, 0.545107), (1, 0.616673), (5, 0.30411), (0, 0.434737), (2, 0.588249), (9, 0.991362), (8, 0.772253), (6, 0.705396), (5, 0.323255), (8, 0.830319), (3, 0.679546), (4, 0.399748), (4, 0.440115), (6, 0.938154), (8, 0.333143), (9, 0.923541), (7, 0.19552), (4, 0.869822), (7, 0.620006), (4, 0.833529), (4, 0.297515), (4, 0.19906), (5, 0.540905), (9, 0.33313), (5, 0.200515), (5, 0.900481), (6, 0.02665), (3, 0.495421), (0, 0.96582), (9, 0.847218);

--- I want approximately (within variation of common percentile methods) the following:

id  x
----------
0   0.9658
1   0.7624
2   0.6953
3   0.6795
4   0.8335
5   0.7205
6   0.9023
7   0.677
8   0.9452
9   0.9914

The actual input set has about two million rows, and each actual id group has a few dozen to a few hundred (or possibly more) rows.

I've explored SO and other sites for solutions, but it seems like the couple dozen or so pages I checked have solutions that are only applicable to computing a percentile over an entire row set rather than each group/partition of a row set. (I'm relatively inexperienced with SQL, so I might have overlooked something.)

I've also looked at the docs for the ranking functions, but I haven't been able to glue together a query that would work.

I'd like to use PERCENTILE_DISC or PERCENTILE_CONT, but I'm stuck with 2008 R2 for now.

Upvotes: 0

Views: 8899

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269503

I like to do these calculations directly, using row_number()/rank() and window functions. The built-in functions are useful, but they don't actually save that much effort:

SELECT id,
       MIN(CASE WHEN seqnum >= 0.9 * cnt THEN x END) as percentile_90
FROM (select t.*,
             row_number() over (partition by id order by x) as seqnum,
             count(*) over (partition by id) as cnt
      from my_table t
     ) t
GROUP BY id
ORDER BY id;

This takes the first value that is at the 90th percentile or greater. There are variations on this that can do the continuous version (take the largest value less than or equal to and the smallest one bigger than and interpolate).

Upvotes: 6

Related Questions