Sandeep
Sandeep

Reputation: 1401

MySQL: Sophisticated GROUP BY query

We are trying to find a solution to a problem that I have been able to replicate with an illustrative table. Here's how the illustrative table is created:

create table illustrativeTable
(
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  label VARCHAR(4),
  reportingDate DATE,
  attr_1 INT,
  attr_2 INT,
  attr_3 INT,

  PRIMARY KEY(id)
);

I have populated the illustrative table as follows:

INSERT INTO illustrativeTable(label, reportingDate, attr_1, attr_2, attr_3) VALUES('A', '2018-01-01', '1', '3', '100'),
('A', '2018-01-05', '2', '4', '125'), ('A', '2018-01-07', '2', '5', '125'),
('A', '2018-01-08', '3', '6', '150'), ('A', '2018-01-11', '4', '7', NULL),
('B', '2018-01-02', '1', '3', '50'), ('B', '2018-01-05', '2', '5', '75'),
('B', '2018-01-06', '3', '6', '200'), ('B', '2018-01-16', '3', '5', '200'),
('C', '2018-01-05', '6', '9', '175'),('C', '2018-01-08', '7', '9', '225'),
('D', '2018-01-01', '2', '5', '55'), ('D', '2018-01-15', '3', '6', 85),
('D', '2018-01-21', '4', '7', '75'), ('E', '2018-01-25', '2', '4', '65'),
('E', '2018-01-28', '2', '5', NULL);

The query

SELECT * FROM illustrativeTable;

yields the following output:

+----+-------+---------------+--------+--------+--------+
| id | label | reportingDate | attr_1 | attr_2 | attr_3 |
+----+-------+---------------+--------+--------+--------+
|  1 | A     | 2018-01-01    |      1 |      3 |    100 |
|  2 | A     | 2018-01-05    |      2 |      4 |    125 |
|  3 | A     | 2018-01-07    |      2 |      5 |    125 |
|  4 | A     | 2018-01-08    |      3 |      6 |    150 |
|  5 | A     | 2018-01-11    |      4 |      7 |   NULL |
|  6 | B     | 2018-01-02    |      1 |      3 |     50 |
|  7 | B     | 2018-01-05    |      2 |      5 |     75 |
|  8 | B     | 2018-01-06    |      3 |      6 |    200 |
|  9 | B     | 2018-01-16    |      3 |      5 |    200 |
| 10 | C     | 2018-01-05    |      6 |      9 |    175 |
| 11 | C     | 2018-01-08    |      7 |      9 |    225 |
| 12 | D     | 2018-01-01    |      2 |      5 |     55 |
| 13 | D     | 2018-01-15    |      3 |      6 |     85 |
| 14 | D     | 2018-01-21    |      4 |      7 |     75 |
| 15 | E     | 2018-01-25    |      2 |      4 |     65 |
| 16 | E     | 2018-01-28    |      2 |      5 |   NULL |
+----+-------+---------------+--------+--------+--------+

Our issue is that we want to retrieve label, reportingDate and attr_3, from selected tuples, with the following constraints:
1) attr_2 - attr_1 = 3
2) attr_3 IS NOT NULL
3) In case of multiple hits, the value with the highest value for reportingDate is selected

The simplistic query:

SELECT label, reportingDate, attr_3 FROM illustrativeTable
WHERE label IN ('A', 'B', 'C', 'E') AND (attr_2-attr_1=3)
AND attr_3 IS NOT NULL GROUP BY label;

yields the following result:

+-------+---------------+--------+
| label | reportingDate | attr_3 |
+-------+---------------+--------+
| A     | 2018-01-07    |    125 |
| B     | 2018-01-05    |     75 |
| C     | 2018-01-05    |    175 |
+-------+---------------+--------+

The problem with this result is that for label 'A' the highest reportingDate that meets all the constraints is 2018-01-08. Similarly for label 'B' the highest reportingDate that meets all the constraints is 2018-01-06.

We would like to tweak the query so that the output looks as follows:

+-------+---------------+--------+
| label | reportingDate | attr_3 |
+-------+---------------+--------+
| A     | 2018-01-08    |    150 |
| B     | 2018-01-06    |    200 |
| C     | 2018-01-05    |    175 |
+-------+---------------+--------+

I did try some ideas from https://paulund.co.uk/get-last-record-in-each-mysql-group but I could not get the results I am looking for.

Upvotes: 1

Views: 64

Answers (1)

Nick
Nick

Reputation: 147146

To get the output for the highest reportingDate, you just need to add that as a constraint to the query. Note that unless you have multiple data values for a given reportingDate, you don't need a GROUP BY clause:

SELECT label, reportingDate, attr_3 
FROM illustrativeTable it1
WHERE label IN ('A', 'B', 'C', 'E') AND 
      (attr_2-attr_1=3) AND
      attr_3 IS NOT NULL AND
      reportingDate = (SELECT MAX(reportingDate) 
                       FROM illustrativeTable 
                       WHERE label = it1.label AND 
                             attr_2-attr_1=3 AND 
                             attr_3 IS NOT NULL)

Output:

label   reportingDate   attr_3  
A       2018-01-08      150
B       2018-01-06      200
C       2018-01-05      175

If you do have multiple values for a given reportingDate, you will need to GROUP BY label, and you will also have to decide whether you want the minimum or maximum value of attr_3, in which case you would change attr_3 in the query to MIN(attr_3) or MAX(attr_3) respectively.

Update

Based on the additional criteria specified by OP in a comment below, this is probably the most efficient query to get the desired result. It joins a sub-select of the initial table (with the non-aggregating conditions applied) to two other tables which give the maximum reportingDate by label and the maximum value of attr2 by reportingDate and label respectively, using the JOIN condition to then filter out all entries which do not match MAX(reportingDate) and MAX(attr_2).

SELECT it1.label, it1.reportingDate, it1.attr_3 
FROM (SELECT * 
      FROM illustrativeTable
      WHERE label IN ('A', 'B', 'C', 'E') AND 
            (attr_2-attr_1=3) AND
             attr_3 IS NOT NULL) it1
JOIN (SELECT label, MAX(reportingDate) AS max_reportingDate
      FROM illustrativeTable it1
      WHERE attr_2-attr_1=3 AND attr_3 IS NOT NULL
      GROUP BY label) it2
  ON it2.label = it1.label AND it2.max_reportingDate = it1.reportingDate
JOIN (SELECT label, reportingDate, MAX(attr_2) AS max_attr_2
      FROM illustrativeTable it1
      WHERE attr_2-attr_1=3 AND attr_3 IS NOT NULL
      GROUP BY label, reportingDate) it3
  ON it3.label = it1.label AND it3.reportingDate = it1.reportingDate AND it3.max_attr_2 = it1.attr_2
ORDER BY it1.label

For the sample data the output remains the same however I have tested it with data which triggers the MAX(attr_2) condition on rextester.

Upvotes: 2

Related Questions