FatalMojo
FatalMojo

Reputation: 435

Joins and count accross several tables

I'm trying to write a complex (at least, for my level of knowledge) string but I'm having one hell of a time.

Here's the problem. I have two tables, one named t1 and one named c1.

The tables are defined as follow:

table T1:

e_id, char(8),  
e_date, datetime,  
e_status, varchar(2)

table C1:

e_id, char(8),  
e_date, datetime,  
e_status, varchar(2)

Each table contains a list of identifiers that may or may not be found in both tables (they may or may not be unique within each table), and associated statuses (can be 'OK' or 'R' in the T1 table, can be 'OK' or 'C' in the C1 table), and a datetime, e_date, associated with each occurence of e_id's

I'm trying to write a query that will:

I'll do my best to write some sample data/results here. For clarity, I will disregard the tables datatypes. Assume the current date and time are 2012-Nov-08 19:00:00

T1:

  1. e_id: 'A', e_date: 2012-Nov-08 10:00:00, e_status: 'OK'
  2. e_id: 'A', e_date: 2012-Nov-08 10:00:00, e_status: 'R'
  3. e_id: 'A', e_date: 2012-Oct-15 10:00:00, e_status: 'R'
  4. e_id: 'B', e_date: 2012-Oct-15 10:00:00, e_status: 'OK'
  5. e_id: 'A', e_date: 2012-Oct-15 10:00:00, e_status: 'OK'
  6. e_id: 'A', e_date: 2012-Oct-15 10:00:00, e_status: 'R'
  7. e_id: 'A', e_date: 2012-Oct-15 10:00:00, e_status: 'R'
  8. e_id: 'A', e_date: 2010-Jan-01 10:00:00, e_status: 'R'
  9. e_id: 'A', e_date: 2010-Jan-01 10:00:00, e_status: 'R'

C1:

  1. e_id: 'A', e_date: 2012-Oct-01 10:00:00, e_status: 'C
  2. e_id: 'B', e_date: 2012-Oct-01 10:00:00, e_status: 'OK'
  3. e_id: 'A', e_date: 2012-Oct-01 10:00:00, e_status: 'C
  4. e_id: 'B', e_date: 2012-Oct-01 10:00:00, e_status: 'OK'
  5. e_id: 'A', e_date: 2012-Oct-01 10:00:00, e_status: 'OK'

Running the query would yield:

e_id, e_date, e_status, r_count, c_count
1. e_id: 'A', e_date: 2012-Nov-08 10:00:00, e_status: 'OK', r_count: 6, c_count: 2
2. e_id: 'A', e_date: 2012-Nov-08 10:00:00, e_status: 'R', r_count: 6, c_count: 2
3. e_id: 'A', e_date: 2012-Oct-15 10:00:00, e_status: 'R', r_count: 6, c_count: 2
4. e_id: 'A', e_date: 2012-Oct-15 10:00:00, e_status: 'OK', r_count: 6, c_count: 2
5. e_id: 'A', e_date: 2012-Oct-15 10:00:00, e_status: 'R', r_count: 6, c_count: 2
6. e_id: 'A', e_date: 2012-Oct-15 10:00:00, e_status: 'R', r_count: 6, c_count: 2

I am really sorry, I have had to change the date on T1 rows 3 to 7 (rows 3 4 5 6 of the results) as the values were erroneous.

T1's Row 4 was not returned because no e_id: B was found in the last 24 hours
T1 Rows 8 and 9 were not returned because they were outside of the last 30 days

Upvotes: 1

Views: 139

Answers (1)

Jonathan Leffler
Jonathan Leffler

Reputation: 753775

Time to do some TDQD — Test-Driven Query Design.

Rows in T1 from the last 24 hours

SELECT DISTINCT e_id
  FROM T1
 WHERE e_date >= DATE_SUB(NOW(), INTERVAL 24 HOUR)

This will be a prevalent sub-query in the other parts of the query.

Rows in T1 from the last 30 days...

...where there was an entry in T1 within the last 24 hours.

SELECT a.e_id
  FROM t1 AS a
  JOIN (SELECT DISTINCT e_id
          FROM T1
         WHERE e_date >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
       ) AS b ON b.e_id = a.e_id
 WHERE a.e_date >= DATE_SUB(NOW(), INTERVAL 30 DAY)

We can add other columns as we need them.

Count of rows in T1 with status 'R' ...

...where there was an entry in T1 within the last 24 hours

SELECT a.e_id, COUNT(*) AS r_count  -- Per question; why not t_count?
  FROM t1 AS a
  JOIN (SELECT DISTINCT e_id
          FROM T1
         WHERE e_date >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
       ) AS b ON b.e_id = a.e_id
 WHERE a.e_status = 'R'
 GROUP BY a.e_id

Count of rows in C1 with status 'C' ...

...where there was an entry in T1 within the last 24 hours

SELECT a.e_id, COUNT(*) AS c_count
  FROM c1 AS a
  JOIN (SELECT DISTINCT e_id
          FROM T1
         WHERE e_date >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
       ) AS b ON b.e_id = a.e_id
 WHERE a.e_status = 'C'
 GROUP BY a.e_id

Assemble the set of queries to produce the result

SELECT a.e_id, a.e_date, a.e_status, c.r_count, d.c_count
  FROM t1 AS a
  JOIN (SELECT DISTINCT e_id
          FROM T1
         WHERE e_date >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
       ) AS b ON b.e_id = a.e_id
  LEFT JOIN -- Because there might be no OK rows in T1
       (SELECT a.e_id, COUNT(*) AS r_count
          FROM t1 AS a
          JOIN (SELECT DISTINCT e_id
                  FROM T1
                 WHERE e_date >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
               ) AS b ON b.e_id = a.e_id
         WHERE a.e_status = 'OK'
         GROUP BY a.e_id
       ) AS c ON c.e_id = a.e_id
  LEFT JOIN -- Because there might be no OK rows in C1
       (SELECT a.e_id, COUNT(*) AS c_count
          FROM c1 AS a
          JOIN (SELECT DISTINCT e_id
                  FROM T1
                 WHERE e_date >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
               ) AS b ON b.e_id = a.e_id
         WHERE a.e_status = 'OK'
         GROUP BY a.e_id
       ) AS d ON d.e_id = a.e_id
 WHERE a.e_date >= DATE_SUB(NOW(), INTERVAL 30 DAY)

You probably could write the sub-queries without the 24 hour sub-sub-query, but it is likely to be effective to eliminate as many rows as soon as possible.


One advantage of the concept behind TDQD is that you can check interim results. There were some basically trivial syntax issues (in part because MySQL is not my primary DBMS), but the change from JOIN to LEFT JOIN for the two COUNT sub-queries is the sort of thing you're apt to spot as you assemble the query. Trying to get everything right first time is — hard, if not futile. But the step-by-step build-up can give you confidence in what you've done. I'd never build a query as complex as this from scratch without testing the component sub-queries.

Thanks for the (minor) updates, FatalMojo.

Upvotes: 2

Related Questions