Reputation: 527

postgresql - Top 5 values in magnitude for multiple columns, alongside sum of values

I'm looking for advice on the best (most efficient) way to compute the top 5 values in magnitude for multiple columns, and I also need to compute the sum for another column.

Say I have data with the headings: (person, daydate, month, dailyqty, dailymax1, dailymax2), where for each day for each person i have the total quantity, the max quantity (measure 1) and the max quantity (measure 2).

What I want to do is compute for each person, for each month (1) the sum of dailyqty, (2) the top 5 values of dailymax1 (3) the top 5 values for dailymax2. It may be the case that there aren't even 5 daily values in a month, in which case I wish to return null.

I can't think of how to do this without a join, since I'm an sql newbie. I'm aware that the sum of dailyqty will be repeated for each of the top 5 values in the group - this is okay.

Some dummy data:

CREATE TABLE test (
    person varchar(50),
    daydate date,
    month integer,
    dailyqty double precision,
    dailymax1 double precision,
    dailymax2 double precision
);

INSERT INTO test(person, daydate, month, dailyqty, dailymax1, dailymax2)
VALUES
    ('A', '2015-01-01', 1, 5, 0.5, 4),
    ('A', '2015-01-02', 1, 8, 3, 4),
    ('A', '2015-01-03', 1, 7, 1, 3),
    ('A', '2015-01-04', 1, 1, 2, 2),
    ('A', '2015-01-05', 1, 9, 6, 8),
    ('A', '2015-01-06', 1, 7, 2.5, 7),
    ('A', '2015-01-07', 1, 2, 4, 7),
    ('A', '2015-01-08', 1, 5, 1, 3),
    ('B', '2015-01-01', 1, 20, 8, 1),
    ('B', '2015-01-02', 1, 22, 9, 2)

Desired result

Thanks! A

Upvotes: 1

Answers (3)

Jorge Y.

Reputation: 1133

This query replicates the desired results posted in the question:

SELECT xt1.person, xt1.month, xt1.monthlyqty, xt3.max1, xt4.max2
FROM (
  SELECT SUM(COALESCE(t.dailyqty, 0)) as monthlyqty, t.person, t.month
  FROM test t
  GROUP by t.person, t.month
) xt1
CROSS JOIN (
  VALUES (1), (2), (3), (4), (5)
) xt2
LEFT OUTER JOIN (
  SELECT t.person, t.month, t.dailymax1 as max1
  , ROW_NUMBER() OVER (PARTITION BY t.person, t.month ORDER BY t.dailymax1 DESC NULLS LAST) as colnumber
  FROM test t
) xt3 ON xt2.column1 = xt3.colnumber AND xt1.person = xt3.person AND xt1.month = xt3.month
LEFT OUTER JOIN (
  SELECT t.person, t.month, t.dailymax2 as max2
  , ROW_NUMBER() OVER (PARTITION BY t.person, t.month ORDER BY t.dailymax2 DESC NULLS LAST) as colnumber
  FROM test t
) xt4 ON xt2.column1 = xt4.colnumber AND xt1.person = xt4.person AND xt1.month = xt4.month;

Some things to consider that could change the query... First you could think about whether the columns dailyqty, dailymax1 and dailymax2 are really nullable (like in your table definition). If they were not, you could simplify the COALESCE(t.dailyqty, 0) to just t.dailyqty and the two DESC NULLS LAST to just DESC.

Second, you may consider replacing the CROSS JOIN to xt2 with a JOIN to a call to generate_series like: CROSS JOIN generate_series (1, 5) xt2 and then replacing xt2.column1 appearances with just xt2. I'm not sure which approach would be more efficient, maybe both do a similar thing, but it is worth to check with your real data if there are significant differences.

Last, you say you want to compute for each person and month, but "month" could refer to the "month" column or the month in the "daydate" column. I've picked the first option as it is easier to write :), but modifying a couple of things the query could be adapted for the other column.

Upvotes: 2

CodeFuller

Reputation: 31312

This query gives the output you need:

WITH FilledData AS
(
    WITH Filler AS
    (
        WITH t1 AS
        (
            SELECT DISTINCT person, month FROM test
        ),
        t2 AS
        (
            SELECT generate_series as order FROM generate_series(1, 4)
        )
        SELECT t1.person, t1.month, CAST(NULL AS double precision) AS dailyqty, CAST(NULL AS double precision) AS dailymax1, CAST(NULL AS double precision) AS dailymax2 FROM t1 CROSS JOIN t2
    )
    SELECT person, month, dailyqty, dailymax1, dailymax2 FROM test UNION ALL
    SELECT person, month, dailyqty, dailymax1, dailymax2 FROM Filler ORDER BY person, month
),
monthlyqty AS
(
    SELECT person, month, SUM(dailyqty) AS monthlyqty FROM test GROUP BY person, month
),
dailymax1_table AS
(
    SELECT person, month, dailymax1, dailymax1_order
    FROM (
       SELECT *, row_number() over (partition by person, month order by dailymax1 desc NULLS LAST) as dailymax1_order
       FROM FilledData
    ) t1 WHERE dailymax1_order <= 5
),
dailymax2_table AS
(
    SELECT person, month, dailymax2, dailymax2_order
    FROM (
       SELECT *, row_number() over (partition by person, month order by dailymax2 desc NULLS LAST) as dailymax2_order
       FROM FilledData
    ) t2 WHERE dailymax2_order <= 5
)

SELECT dailymax1_table.person, dailymax1_table.month, monthlyqty.monthlyqty, dailymax1_table.dailymax1 as max1, dailymax2_table.dailymax2 as max2
FROM dailymax1_table JOIN monthlyqty
ON monthlyqty.person = dailymax1_table.person AND
   monthlyqty.month = dailymax1_table.month
JOIN dailymax2_table ON
   dailymax1_table.person = dailymax2_table.person AND
   dailymax1_table.month = dailymax2_table.month AND
   dailymax1_table.dailymax1_order = dailymax2_table.dailymax2_order;

Upvotes: 1

Gordon Linoff

Reputation: 1270873

You can use window functions to put this together:

select . . .
from (select t.*,
             sum(dailyqty) over (partition by person, date_trunc('month', datecol)) as monthqty,
             row_number() over (partition by person, date_trunc('month', datecol) order by dailyqty desc) as seqnum
      from t
     ) t
where seqnum <= 5;

You can extract the columns that you want from the subquery.

Upvotes: 0

postgresql - Top 5 values in magnitude for multiple columns, alongside sum of values

Answers (3)

Related Questions