Reputation: 40850
I have the following table in a PostgreSQL 9.6 DB:
factorA | factorB | factorC | result
---------+---------+---------+-------
1 | 5 | 100 | 1
1 | 6 | 200 | 3
2 | 6 | 60 | 6
2 | 5 | 70 | 10
The table is of course much larger with many more factors that follow.
Now I have two questions:
(The factorC value is only there to show that there are other columns which however are not of interest for this question)
To answer question 1 we see that sum(value)
is 20 and sum(value where factorA = 1)
is 4 and sum(value where factorA = 2)
is 16. So we can calculate the factorA=1 contributed 4/20=20% and factorB=2 contributed 16/20=80%. The same principle applies to factorB. So the desired result is something I can read those values from:
{"factorA": {1: 0.2, 2: 0.8, ...}, "factorB": {5: 0.55, 6: 0.45, ...}}
Answering each of those two questions individually in SQL is no problem:
select * from (
select
factorA,
row_number() over (partition by 't' order by sum(result) desc) rank,
sum(result) / sum(sum(result)) over (partition by 't') contribution
from (values(1,5,100,1), (1,6,200,3), (2,6,60,6), (2,5,70,10)) x(factorA, factorB, factorC, result)
group by factorA
) x where rank <= 5
This results in:
factorA | rank | contribution
---------+------+-----------------------
2 | 1 | 0.80000000000000000000
1 | 2 | 0.20000000000000000000
Replacing all "factorA" with "factorB" will answer the second question. However the problem is that in reality the table does not only contain 4 values, but actually hundred thousands of values, but not very many factors. So to run the query a full table scan is almost inevitable. Thus the query will be slow. Running it twice, once for factorA and once for factorB causes two table scans. I would like to avoid that and answer both questions with a single query.
My best try so far is this:
select *
from
(
select
x.*,
dense_rank() over (order by contributionA desc) rankA,
dense_rank() over (order by contributionB desc) rankB
from (
select
factorA,
factorB,
sum(sum(result)) over (partition by factorA) / sum(sum(result)) over (partition by 't')::float contributionA,
sum(sum(result)) over (partition by factorB) / sum(sum(result)) over (partition by 't')::float contributionB
from (values(1,5,100,1), (1,6,200,3), (2,6,60,6), (2,5,70,10)) x(factorA, factorB, factorC, result)
group by factorA, factorB
) x
) x
where rankA <= 5 and rankB <= 5
This results in:
factora | factorb | contributiona | contributionb | ranka | rankb
---------+---------+---------------+---------------+-------+-------
2 | 5 | 0.8 | 0.55 | 1 | 1
1 | 5 | 0.2 | 0.55 | 2 | 1
2 | 6 | 0.8 | 0.45 | 1 | 2
1 | 6 | 0.2 | 0.45 | 2 | 2
And that is a reasonable result except that the filtering for "top 5" is broken, because if two contributions are actually identical, then they share the same rank and thus the result will have 6 or more results. As a demonstration here is the query with only the "top 1", but it still gives the top 2:
select *
from
(
select
x.*,
dense_rank() over (order by contributionA desc) rankA,
dense_rank() over (order by contributionB desc) rankB
from (
select
factorA,
factorB,
sum(sum(result)) over (partition by factorA) / sum(sum(result)) over (partition by 't')::float contributionA,
sum(sum(result)) over (partition by factorB) / sum(sum(result)) over (partition by 't')::float contributionB
from (values(1,5,100,1), (1,6,200,1), (2,6,60,1), (2,5,70,1)) x(factorA, factorB, factorC, result)
group by factorA, factorB
) x
) x
where rankA <= 1 and rankB <= 1
Upvotes: 1
Views: 61
Reputation: 1270873
I think you want to aggregate after calculating the ranks, not before:
select sum(case when rankA <= 5 then cumeFactorA end) / sum(result::float),
sum(case when rankB <= 5 then cumeFactorB end) / sum(result::float)
from (select x.*,
dense_rank() over (order by factorA desc) as rankA,
dense_rank() over (order by factorB desc) as rankB,
sum(result) over (order by factorA) as cumeFactorA,
sum(result) over (order by factorB) as cumeFactorB,
sum(result) over () as total_result
from (values(1,5,100,1), (1,6,200,1), (2,6,60,1), (2,5,70,1)) x(factorA, factorB, factorC, result)
) x
where rankA <= 5 or rankB <= 5;
Upvotes: 1