Luca Folin
Luca Folin

Reputation: 101

find data missing on monthly base

With this data table "table":

enter image description here

and a table "ID" of well known newID value (1,2,3)

enter image description here

I would like to find all the months in which some newID value are missing. The result should only include months that that have at least 1 newId.

This is the expected result for the data listed above:

enter image description here

How can I achieve this task ?

Table Schema: (fiddle) http://sqlfiddle.com/#!9/72af42

CREATE TABLE `ID`(
    Id int unsigned not null primary key
);
INSERT INTO `ID` (Id) VALUES (1),(2),(3);

CREATE TABLE `table`(
    recId int unsigned not null auto_increment primary key,
    Data DateTime not null,
    newID int not null
);

INSERT INTO `table` (`Data`,`newID`) VALUES
('2017-12-06',1),
('2017-12-06',3),
('2017-11-16',1),
('2017-11-16',2),
('2017-11-16',3),
('2017-10-05',2),
('2017-10-05',3),
('2017-10-03',2),
('2017-10-03',3),
('2017-08-16',1),
('2017-08-16',2),
('2017-08-16',3),
('2017-05-05',1),
('2017-05-05',2),
('2017-05-05',3);

Upvotes: 1

Views: 104

Answers (1)

Chris Schaller
Chris Schaller

Reputation: 16554

Because you only need the missing ID records that have a correlated Month, this is a simple Non-Correlated sub-query scenario.

In SQL a Non-Correlated sub-query can be modeled either using a NOT EXISTS expression in the WHERE clause, or you can use a LEFT OUTER JOIN and only include the NULL results for the columns in the joined table, as this would indicate the records where no match was found

This query in only complicated by the fact that you want to evaluate the MONTH component of the date, not the explicit date value. SQL gives use all the necessary tools for this, we can even format the output that you desire:

SELECT DATE_FORMAT(CAST(CONCAT(m.year,'-',m.month,'-01') as DateTime), '%b-%y') as Data, i.Id as newId
FROM (
  SELECT YEAR(Data) AS Year, MONTH(Data) AS month
  FROM `table`
  GROUP BY YEAR(Data), MONTH(Data)
) m
CROSS JOIN `ID` i
LEFT OUTER JOIN `table` t ON YEAR(t.Data) = m.year AND MONTH(t.Data) = m.month AND t.newId = i.Id
WHERE t.Data IS NULL
ORDER BY m.Year DESC, m.month DESC

Have a look at this fiddle: http://sqlfiddle.com/#!9/72af42/2


Choosing between WHERE NOT EXISTS AND LEFT OUTER JOIN can affect performance slightly, but the affect will depend on your query, your RDBMS and the available indexes. I personally use the JOIN syntax first because IMO it is simpler to maintain, but you use your own discretion.

There is a lot of talk at least in MS SQL that NOT EXISTS should be faster than JOIN but if performance is an issue you for this specific query you should look at storing the year and month columns as persisted values so that they can be indexed and to reduce the function evaluatations.because it will evaluate less lookups.

For a comparison, this is the equivalent WHERE NOT EXISTS query: http://sqlfiddle.com/#!9/72af42/5

SELECT DATE_FORMAT(CAST(CONCAT(m.year,'-',m.month,'-01') as DateTime), '%b-%y') as Data, i.Id as newId
FROM (
  SELECT YEAR(Data) AS Year, MONTH(Data) AS month
  FROM `table`
  GROUP BY YEAR(Data), MONTH(Data)
) m
CROSS JOIN `ID` i
WHERE NOT EXISTS (
  SELECT * 
  FROM `table` t 
  WHERE YEAR(t.Data) = m.year 
    AND MONTH(t.Data) = m.month 
    AND t.newId = i.Id
)
ORDER BY m.Year DESC, m.month DESC

How to Optimise using Persisted Values?

If we pre-evaluate the YEAR() and MONTH() and store the results in the table directly, then the query speed will improve but we can also add indexes to super-charge it.

Consider the over all PROs and CONs before going this far...

  • Do you really need this level of optimisation?
  • How often is the query going to be executed?
  • Can you change the application logic to use a more appropriate WHERE clause to restrict the scope of the data instead?

Materialized Views

One solution to this is to create and manage a materialised view. This is a DW technique that effectively allows you to define a view but have it executed on a regular basis and stored into it's own table space.

A materialised View does not optimise your query, but it allows you to execute complex and long-running query once, so that the results can be queried directly like a normal table, without having to re-evaluate column expressions.

Your data and the type of query looks like a good candidate for materialized view because it is querying historical data that has a zero or very low rate of change, only new rows are updated, and we probably do not care for the current month results. In this case if you end up running the query many times, and the result is staying more or less constant, then why not run the query as a process, say every month and store the results in a purpose built table, then your application can query that table frequently with lightning fast results.

MySQL does not support Materialized Views, but you can replicate the concept as explained above in your application logic, some other RDBMS provide this OOTB, its the concept that should be considered.

Computed Columns

You could just add the extra columns to your table and maintain these columns from the user/application logic, but that is not very reliable, unless you trust your application developers and the app is the only process that will be updating this table.

Computed columns are perfect for reliability in this scenario, but they will only help us with performance if you can persist the value to the column store. (the default state of a computed column is that the expression will be evaluated at execution time, which offers little benefit over the current query)

Again this is where MySQL will let you down, many other RDBMS offer simpler ways to do this, you need MySQL v5.7 for this to work

ALTER TABLE `table` ADD `year` GENERATED ALWAYS AS (YEAR(Data)) STORED;
ALTER TABLE `table` ADD `month` GENERATED ALWAYS AS (MONTH(Data)) STORED;

Trigger

Your other option is to add the columns, and then use triggers to maintain the values, MySQL doesn't make this easy, but it can work

  1. Add the columns to your table:

    ALTER TABLE `table` ADD (`year` int NULL);
    ALTER TABLE `table` ADD (`month` int NULL);
    
  2. Create Triggers to manage the values in these columns so the user cannot override them:

    DELIMITER $$
    
    CREATE TRIGGER persist_index_values_insert
      BEFORE INSERT ON `table` FOR EACH ROW
    BEGIN
      SET new.year= YEAR(new.Data);
      SET new.month = MONTH(NEW.Data);
    END$$
    
    CREATE TRIGGER persist_index_values_update
    BEFORE UPDATE ON `table` FOR EACH ROW
    BEGIN
      SET NEW.year = YEAR(NEW.Data);
      SET NEW.month = MONTH(NEW.Data);
    END$$
    

    DELIMITER ;

  3. Simpler Query:

     SELECT DATE_FORMAT(CAST(CONCAT(m.year,'-',m.month,'-01') as DateTime), '%b-%y') as Data, i.Id as newId
     FROM (
       SELECT `year`, `month`
       FROM `table`
       GROUP BY `year`, `month`
     ) m
     CROSS JOIN `ID` i
     LEFT OUTER JOIN `table` t ON t.year = m.year AND t.month = m.month AND t.newId = i.Id
     WHERE t.Data IS NULL
     ORDER BY m.Year DESC, m.month DESC
    
  4. Indexes can now be applied as needed, you should consult your query execution plans for guidance, but I would suggest you need one index for year and month and newId as a minimum:

      CREATE INDEX IX_TABLE_YEAR_MONTH_NEWID ON `table` (`year`,`month`,'newId');
    

Upvotes: 0

Related Questions