Gargoyle
Gargoyle

Reputation: 10324

Select count of max items, get rank and percentile

I have a table with multiple entries per person_id column. I'm storing a score, a category_id, and a created column as well. So every time the person completes the indicated category I store a single record for them.

Now I'm trying to write a query that says: Just using the most recently created score for each person, find out how many people scored worse than I did for a specific category. I'm basically doing a percentile calculation here. So to get the total number scores I'm doing:

select count(distinct person_id) from performances where category_id = 7;

I'm not sure how to write the second query though, that finds out how many people did worse than me. Is this somewhere I'd use that "OVER PARTITION" type windowing function?

Upvotes: 0

Views: 1779

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656804

What you actually asked

Just using the most recently created score for each person

... translates to:

SELECT DISTINCT ON (person_id) *
FROM   performances
ORDER  BY person_id, created DESC;

Do not add a WHERE condition here (yet) or you get different (incorrect) results. Details for DISTINCT ON:

find out how many people scored worse than I did for a specific category.

... translates to:

SELECT *
     , dense_rank() OVER w AS worse_than_me
     , ntile(100)   OVER w AS percentile
FROM  (
   SELECT DISTINCT ON (person_id) *
   FROM   performances
   ORDER  BY person_id, created DESC
  ) p
WINDOW w AS (PARTITION BY category_id ORDER BY score);

Assuming "worse" means a lower score.
The window function dense_rank() is the right tool that answers the question "How many people?" - as opposed to rank() which answers "How many distinct scores?".

ntile(100) over the same window definition gives you the ready percentile as integer, 100 meaning in the top 1 %, 99 meaning in the 2nd best % etc.

However, ntile() returns, per documentation:

integer ranging from 1 to the argument value, dividing the partition as equally as possible

That means, if you should have less than 100 rows in your partition (like you commented), multiple by 100.0 / count(*) to scale the number. A "percentile" is not the most useful statistic for just a hand full of rows in a set, it's typically used on big sets.

What you did not ask I

But quite possibly meant to ask:

"How does each person rank in the category (s)he finished last among all other results in that category?"

Assuming unique entries for (person_id, category_id), or you also have to define how to deal with multiple results per person in the same category (including self).

SELECT *
FROM  (
   SELECT DISTINCT ON (person_id) *
   FROM   performances
   ORDER  BY person_id, created DESC
   ) pers
JOIN (
   SELECT person_id, category_id
        , dense_rank() OVER w AS worse_than_me
        , ntile(100)   OVER w AS percentile
   FROM   performances
   WINDOW w AS (PARTITION BY category_id ORDER BY score)
   ) rnk f USING (person_id, category_id);
  • In the subquery pers we distill the last entry per person (the one of interest).
  • In the subquery rnk we get ranking and percentile compared to all other entries.
  • JOIN with the USING clause, and you got a ready SELECT list without duplicate columns.

What you did not ask II

but would also make more sense if there can be multiple entries per (person_id, category_id):

"Get the rank for the latest score of each person in each category compared to all other latest personal scores in the same category."

SELECT *
     , dense_rank() OVER w AS worse_than_me
     , ntile(100)   OVER w AS percentile
FROM  (
   SELECT DISTINCT ON (person_id, category_id) *
   FROM   performances
   ORDER  BY person_id, category_id, created DESC;
  ) p
WINDOW w AS (PARTITION BY category_id ORDER BY score);

Unclear / ambiguous questions lead to arbitrary results. The first step to a solution is to define the task clearly.

Upvotes: 3

Clodoaldo Neto
Clodoaldo Neto

Reputation: 125254

select
    person_id,
    count(*) over() as total_person,
    rank() over(order by score desc) as score_rank
from (
    select distinct on (person_id) *
    from score
    where category_id = 7
    order by person_id, created desc
) s

Check rank, dense_rank, percent_rank, ntile, and cume_dist:

http://www.postgresql.org/docs/current/static/functions-window.html

distinct on returns a single row from each of the person_ids. Using the order by clause it is possible to choose each one.

Upvotes: 1

Related Questions