Reputation: 1417
The smaller subset of data (99% reduction) will then be used, from my reporting application.
As this will be 'built' on a daily basis; what is the best tool - Stored Procedure, View or something else?
Upvotes: 1
Views: 1156
Reputation: 142296
Build and maintain a Summary table. Initially you would need to run a big GROUP BY
to collect all the old data. After that a nightly job would compute COUNT(*)
, SUM(...)
, etc for the previous day.
Then the 'report' would run much faster against this new table.
The key for that table would include day (not date+time), plus a few columns that you may be needing for the report(s).
I find that the typical speedup is 10x; you might get 100x (99% reduction).
The best tool is a script that you run via cron (or perhaps MySQL EVENT
). It would simply do something like
INSERT INTO SummaryTable (dy, ..., ct, tot, ...)
SELECT DATE(datetime), ..., -- key
COUNT(*), SUM(..), ... -- data
FROM FactTable
WHERE datetime >= CURDATE() - INTERVAL 1 DAY
AND datetime < CURDATE();
That one SQL statement may be all that is needed. Yes, it could be in a Stored Procedure, but that is not much different than having it directly in the nightly script.
In some cases it may be better to use INSERT ... ON DUPLICATE KEY UPDATE ... SELECT ...
(but that gets messy).
When talking about "averages", consider the following:
AVG(...)
, butSUM(daily_sums) / SUM(daily_counts)
. That is, the summary table probably needs COUNT(*)
and SUM(...)
.To initially populate this summary table, I would write a one-time script to slowly walk through the 600M rows one day at a time. Sure, you could do it all at once, but the interference with everything else might be 'bad'.
Even better would be for the nightly script to include code to "pick up where it left off". This way, if the script fails to run some night it will repair the omission the next night. Or you can manually run it when you see a problem. And an extra run won't hurt anything.
While you are at it, think about other Summary Tables you might need. I typically find that 3-7 summary tables are needed for a Data Warehouse application. On the other hand, keep in mind that weekly and monthly summaries can be derived (efficiently enough) from a daily summary table. In a couple of cases, I had an Hourly summary table for one thing, then daily tables for different things.
600M rows is big. Will 'old' data be purged? Once you have the summary tables you need, will the 'old' data no longer be needed? Blog on using Partitioning for such.
Upvotes: 2