yankee
yankee

Reputation: 40850

How to get the top 5 elements from different columns with a single query?

Problem

I have the following table in a PostgreSQL 9.6 DB:

 factorA | factorB | factorC | result 
---------+---------+---------+-------
       1 |      5  |     100 |     1
       1 |      6  |     200 |     3
       2 |      6  |      60 |     6
       2 |      5  |      70 |    10

The table is of course much larger with many more factors that follow.

Now I have two questions:

  1. How much did the top 5 factorA-values contribute to result
  2. How much did the top 5 factorB-value contribute to result

(The factorC value is only there to show that there are other columns which however are not of interest for this question)

To answer question 1 we see that sum(value) is 20 and sum(value where factorA = 1) is 4 and sum(value where factorA = 2) is 16. So we can calculate the factorA=1 contributed 4/20=20% and factorB=2 contributed 16/20=80%. The same principle applies to factorB. So the desired result is something I can read those values from:

{"factorA": {1: 0.2, 2: 0.8, ...}, "factorB": {5: 0.55, 6: 0.45, ...}}

Answering each of those two questions individually in SQL is no problem:

select * from (
    select
        factorA,
        row_number() over (partition by 't' order by sum(result) desc) rank,
        sum(result) / sum(sum(result)) over (partition by 't') contribution
    from (values(1,5,100,1), (1,6,200,3), (2,6,60,6), (2,5,70,10)) x(factorA, factorB, factorC, result)
    group by factorA
) x where rank <= 5

This results in:

 factorA | rank |      contribution
---------+------+-----------------------
       2 |    1 | 0.80000000000000000000
       1 |    2 | 0.20000000000000000000

Replacing all "factorA" with "factorB" will answer the second question. However the problem is that in reality the table does not only contain 4 values, but actually hundred thousands of values, but not very many factors. So to run the query a full table scan is almost inevitable. Thus the query will be slow. Running it twice, once for factorA and once for factorB causes two table scans. I would like to avoid that and answer both questions with a single query.

What I have tried

My best try so far is this:

select *
from
(
    select
        x.*,
        dense_rank() over (order by contributionA desc) rankA,
        dense_rank() over (order by contributionB desc) rankB
    from (
        select
            factorA,
            factorB,
            sum(sum(result)) over (partition by factorA) / sum(sum(result)) over (partition by 't')::float contributionA,
            sum(sum(result)) over (partition by factorB) / sum(sum(result)) over (partition by 't')::float contributionB
        from (values(1,5,100,1), (1,6,200,3), (2,6,60,6), (2,5,70,10)) x(factorA, factorB, factorC, result)
        group by factorA, factorB
    ) x
) x
where rankA <= 5 and rankB <= 5

This results in:

 factora | factorb | contributiona | contributionb | ranka | rankb
---------+---------+---------------+---------------+-------+-------
       2 |       5 |           0.8 |          0.55 |     1 |     1
       1 |       5 |           0.2 |          0.55 |     2 |     1
       2 |       6 |           0.8 |          0.45 |     1 |     2
       1 |       6 |           0.2 |          0.45 |     2 |     2

And that is a reasonable result except that the filtering for "top 5" is broken, because if two contributions are actually identical, then they share the same rank and thus the result will have 6 or more results. As a demonstration here is the query with only the "top 1", but it still gives the top 2:

select *
from
(
    select
        x.*,
        dense_rank() over (order by contributionA desc) rankA,
        dense_rank() over (order by contributionB desc) rankB
    from (
        select
            factorA,
            factorB,
            sum(sum(result)) over (partition by factorA) / sum(sum(result)) over (partition by 't')::float contributionA,
            sum(sum(result)) over (partition by factorB) / sum(sum(result)) over (partition by 't')::float contributionB
        from (values(1,5,100,1), (1,6,200,1), (2,6,60,1), (2,5,70,1)) x(factorA, factorB, factorC, result)
        group by factorA, factorB
    ) x
) x
where rankA <= 1 and rankB <= 1

Upvotes: 1

Views: 61

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270873

I think you want to aggregate after calculating the ranks, not before:

select sum(case when rankA <= 5 then cumeFactorA end) / sum(result::float),
       sum(case when rankB <= 5 then cumeFactorB end) / sum(result::float)      
from (select x.*,
             dense_rank() over (order by factorA desc) as rankA,
             dense_rank() over (order by factorB desc) as rankB,
             sum(result) over (order by factorA) as cumeFactorA,
             sum(result) over (order by factorB) as cumeFactorB,
             sum(result) over () as total_result
      from (values(1,5,100,1), (1,6,200,1), (2,6,60,1), (2,5,70,1)) x(factorA, factorB, factorC, result)
    ) x
where rankA <= 5 or rankB <= 5;

Upvotes: 1

Related Questions