dicristina
dicristina

Reputation: 425

Can Window Function be used with Grouping Sets?

Introduction

In this Q&A we will explore the use of window functions and grouping sets in the same query using PostgreSQL. The reader should be familiar with both of those concepts to get the most out of this post. This fiddle can be used to follow along.

Data

The following table contains a few transactions. For each transaction we have the customer, the city where it took place and the number of units delivered to the customer.

     CREATE TABLE transactions (customer TEXT, city TEXT, units INT);
     INSERT INTO transactions VALUES ('Bob', 'Manchester', 10),
                                     ('Chuck', 'Manchester', 20),
                                     ('Bob', 'Concord', 10),
                                     ('Tim', 'Manchester', 15),
                                     ('Jane', 'Derry', 10),
                                     ('Tim', 'Derry', 15),
                                     ('Tim', 'Concord', 20),
                                     ('Bob', 'Manchester', 20),
                                     ('Chuck', 'Concord', 10);

Desired results

I want to be able to produce a report which contains the answers to questions like "what proportion of all transactions is represented by this row" or "what is the ratio of this to all the transactions by this customer". Also I want to be able to answer all such possible questions. With that kind of a report we could simply look at each row and extract information about its relation to the whole report or to specific slices of the data.

First attempt

We could attempt to create such a report by using multiple queries with different GROUP BY clauses and then uniting them with UNION ALL or we can try something like the following:

-- Incorrect results 
  SELECT customer
       , city
       , SUM(units) AS "units"
       , ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER () , 2) AS "to report"
       , ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY customer) , 2) AS "to customer"
       , ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY city) , 2) AS "to city"
    FROM transactions
GROUP BY CUBE(customer, city)

Here we use CUBE in the GROUP BY clause which will produce groupings corresponding to all possible combinations of the customer and city columns. The numerator of the ratios is an aggregate that corresponds to the units total for that row. Notice that it is the same for all ratios and that it contains the expression used as the "units" column, i.e. SUM(units). Calculating the denominator is more complicated because we need a window function to calculate the total number of units for the wanted slice, i.e. the total for a particular customer or city or the total for the whole report.

Incorrect Results

Unfortunately the ratios produced by the query above are not correct. For example the first row has 10 units which is 7.7% of the total (130), 25% of the total for Bob (40), and 25% of the total for Concord (40) yet the results show less than the correct ratio in all cases. As another example take the row where both "customer" and "city" are NULL, here the "unit" column is 130 and yet the calculated ratio "to report" is 25%. Clearly the denominator in the ratio columns is wrong. How can we get the desired results?

customer city units to report to customer to city
Bob Concord 10 1.92 12.50 12.50
Bob Manchester 30 5.77 37.50 23.08
Bob null 40 7.69 50.00 15.38
Chuck Concord 10 1.92 16.67 12.50
Chuck Manchester 20 3.85 33.33 15.38
Chuck null 30 5.77 50.00 11.54
Jane Derry 10 1.92 50.00 20.00
Jane null 10 1.92 50.00 3.85
Tim Concord 20 3.85 20.00 25.00
Tim Derry 15 2.88 15.00 30.00
Tim null 50 9.62 50.00 19.23
Tim Manchester 15 2.88 15.00 11.54
null null 130 25.00 50.00 50.00
null Manchester 65 12.50 25.00 50.00
null Derry 25 4.81 9.62 50.00
null Concord 40 7.69 15.38 50.00

Upvotes: 2

Views: 1138

Answers (1)

dicristina
dicristina

Reputation: 425

Why are the results wrong?

Notice that although the results in the question are wrong they are not totally nonsensical. Take, for example, the row in which both "customer" and "city" are NULL. This row has 130 units which is the total number of units in the data, so we should expect the ratio "to report" to be 100% but the result shows 25%, which means that the denominator of the ratio in that case was four times 130, or 520. Take the first row as another example, here we have 10 units and a ratio "to report" of 1.92%, again the denominator is wrong by a factor of four, i.e. the actual ratio should be 7.69%. Clearly the total of the report is taken to be four times what it actually is.

The results for the "to customer" and "to city" columns are wrong as well but by a different factor. Take for example the rows where "customer" is NULL and "city" is not NULL. The ratio "to city" for the three cities is 50% but should be 100%. This is because the denominator of the ratio is twice what it should be. The same thing happens for the "to customer" ratio. The problem lies in the partitioning of the rows. For instance, there is no PARTITION BY clause in the "to report" column, so we are taking into consideration all of the rows of the report, which add up to 520.

Consider that the GROUP BY clause produces four groupings:

  • (customer, city) - equivalent to GROUP BY customer, city
  • (customer) - equivalent to GROUP BY customer
  • (city) - equivalent to GROUP BY city
  • () - equivalent to using an aggregate without GROUP BY clause

Each of those groupings is a different way to slice the data and for each the units total is 130. In the case of the "to customer" and "to city" columns we have a denominator that is twice as large as it should be. Take for example the cities and notice that the rows where "city" is not NULL contain the units for each city twice: once where "customer" is not NULL and another time where "customer" is NULL. These two categories correspond to the first and third groupings above, respectively. The same can be said for the customers but the rows would correspond to the first and second groupings. Clearly when calculating the denominator of each ratio we need to take into consideration that different rows belong to different slices of the data.

The Key

The key to getting the desired results is to use the GROUPING aggregate function. This function "returns a bit mask indicating which GROUP BY expressions are not included in the current grouping set". In other words it can give a different result for each grouping produced by the GROUP BY clause. We can use this function to help calculate the denominator for each of the ratio columns. To get the desired effect we use the GROUPING function inside the PARTITION BY clause of our window functions like so:

  SELECT customer
       , city
       , SUM(units) AS "units"
       , ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY GROUPING(customer, city)) , 2) AS "to report"
       , ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY GROUPING(customer, city), customer) , 2) AS "to customer"
       , ROUND( 100 * SUM(units) / SUM(SUM(units)) OVER (PARTITION BY GROUPING(customer, city), city) , 2) AS "to city"
    FROM transactions
GROUP BY CUBE(customer, city)

Having the GROUPING function inside the PARTITION BY clause ensures that the denominator for each ratio corresponds only to the rows of a particular grouping. Fortunately we do not have to give much thought to the arguments that we will pass to the GROUPING function. You can simply include all the columns that appear in the GROUP BY, although only the ones that do not appear in all of the grouping sets are necessary.

Desired Result

Now that we are calculating the denominator taking the groupings into consideration we get the correct results.

customer city units to report to customer to city
Bob Manchester 30 23.08 75.00 46.15
Bob Concord 10 7.69 25.00 25.00
Chuck Concord 10 7.69 33.33 25.00
Chuck Manchester 20 15.38 66.67 30.77
Jane Derry 10 7.69 100.00 40.00
Tim Concord 20 15.38 40.00 50.00
Tim Derry 15 11.54 30.00 60.00
Tim Manchester 15 11.54 30.00 23.08
Bob null 40 30.77 100.00 30.77
Chuck null 30 23.08 100.00 23.08
Jane null 10 7.69 100.00 7.69
Tim null 50 38.46 100.00 38.46
null Concord 40 30.77 30.77 100.00
null Derry 25 19.23 19.23 100.00
null Manchester 65 50.00 50.00 100.00
null null 130 100.00 100.00 100.00

Upvotes: 3

Related Questions