Reputation: 163234
I have a table (logs
) that has the following columns (there are others, but these are the important ones):
Basically this is a record for an event that starts at a time and ends at a time. This table currently has a few hundred thousand rows in it. I expect it to grow to millions. For the purpose of speeding up queries, I have added another column and precomputed values:
To calculate EndTime I have added the number of seconds in Duration
to the Timestamp
field.
Now what I want to do is run a query where the result counts the number of rows where the start (Timestamp
) and end times (EndTime
) fall outside of a certain point in time. I then want to run this query for every second for a large timespan (such as a year). I would also like to count the number of rows that start on a particular point in time, and end at a particular point in time.
I have created the following query:
SELECT
`dates`.`date`,
COUNT(*) AS `total`,
SUM(IF(`dates`.`date`=`logs`.`Timestamp`, 1, 0)) AS `new`,
SUM(IF(`dates`.`date`=`logs`.`EndTime`, 1, 0)) AS `dropped`
FROM
`logs`,
(SELECT
DATE_ADD("2010-04-13 09:45:00", INTERVAL `number` SECOND) AS `date`
FROM numbers LIMIT 120) AS dates
WHERE dates.`date` BETWEEN `logs`.`Timestamp` AND `logs`.`EndTime`
GROUP BY `dates`.`date`;
Note that the numbers table is strictly for easily enumerating a date range. It is a table with one column, number
, and contains the values 1, 2, 3, 4, 5, etc...
This gives me exactly what I am looking for... a table with 4 columns:
The trouble is, this query can take a significant amount of time to execute. To go through 120 seconds (as shown in the query), it takes about 10 seconds. I suspect that this is about as fast as I am going to get it, but I thought I would ask here if anyone had any ideas for improving the performance of this query.
Any suggestions would be most helpful. Thank you for your time.
Edit: I have indexes on Timestamp and EndTime.
The output of EXPLAIN on my query:
"id";"select_type";"table";"type";"possible_keys";"key";"key_len";"ref";"rows";"Extra"
"1";"PRIMARY";"<derived2>";"ALL";NULL;NULL;NULL;NULL;"120";"Using temporary; Using filesort"
"1";"PRIMARY";"logs";"ALL";"Timestamp,EndTime";NULL;NULL;NULL;"296159";"Range checked for each record (index map: 0x6)"
"2";"DERIVED";"numbers";"index";NULL;"PRIMARY";"4";NULL;"35546940";"Using index"
When I run analyze on my logs table, it says status OK.
Upvotes: 5
Views: 16724
Reputation: 77400
Note in the EXPLAIN
output that the join type for the logs
table is "ALL" and the key is NULL, which means a full table scan is scheduled. The "Range checked for each record" message means that MySQL uses the range access method on logs
after examining column values from somewhere else in the result. I take this to mean that once dates
has been created, MySQL can perform a ranged join on logs
using the second and third indices (likely those on Timestamp
and EndTime
) rather than performing a full table scan. If you only have indices on Timestamp
and EndTime
separately, try adding an index on both, which might result in a more efficient join type (e.g. index_merge
rather than range
):
CREATE INDEX `start_end` ON `logs` (`Timestamp`, `EndTime`);
I believe (though could easily be wrong) that other items in the query plan either aren't really a concern or can't be eliminated. The filesort, as an example of the latter, is likely due to the GROUP BY
. In other words, this is likely the extent of what you can do with this particular query, though radically different queries or approaches that address table storage format are still possibly more efficient.
Upvotes: 5
Reputation: 1998
You could look at merge tables to speedup the processing. With merge tables, since the tables are split up, the indexes are smaller resulting in faster fetching. Also, if you have multiple processors, the searches can happen in parallel increasing the performance.
Upvotes: 2