Reputation: 59333

Is there any way to optimize this SELECT query any further?

I have a MySQL table that is filled with mails from a postfix mail log. The table is updated very often, some times multiple times per second. Here's the SHOW CREATE TABLE output:

Create Table postfix_mails CREATE TABLE `postfix_mails` (
 `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 `mail_id` varchar(20) COLLATE utf8_danish_ci NOT NULL,
 `host` varchar(30) COLLATE utf8_danish_ci NOT NULL,
 `queued_at` datetime NOT NULL COMMENT 'When the message was received by the MTA',
 `attempt_at` datetime NOT NULL COMMENT 'When the MTA last attempted to relay the message',
 `attempts` smallint(5) unsigned NOT NULL,
 `from` varchar(254) COLLATE utf8_danish_ci DEFAULT NULL,
 `to` varchar(254) COLLATE utf8_danish_ci NOT NULL,
 `source_relay` varchar(100) COLLATE utf8_danish_ci DEFAULT NULL,
 `target_relay` varchar(100) COLLATE utf8_danish_ci DEFAULT NULL,
 `target_relay_status` enum('sent','deferred','bounced','expired') COLLATE utf8_danish_ci NOT NULL,
 `target_relay_comment` varchar(4098) COLLATE utf8_danish_ci NOT NULL,
 `dsn` varchar(10) COLLATE utf8_danish_ci NOT NULL,
 `size` int(11) unsigned NOT NULL,
 `delay` float unsigned NOT NULL,
 `delays` varchar(50) COLLATE utf8_danish_ci NOT NULL,
 `nrcpt` smallint(5) unsigned NOT NULL,
 PRIMARY KEY (`id`),
 UNIQUE KEY `mail_signature` (`host`,`mail_id`,`to`),
 KEY `from` (`from`),
 KEY `to` (`to`),
 KEY `source_relay` (`source_relay`),
 KEY `target_relay` (`target_relay`),
 KEY `target_relay_status` (`target_relay_status`),
 KEY `mail_id` (`mail_id`),
 KEY `last_attempt_at` (`attempt_at`),
 KEY `queued_at` (`queued_at`)
) ENGINE=InnoDB AUTO_INCREMENT=111592 DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci

I want to know how many mails were relayed through a specific host on a specific date, so I'm using this query:

SELECT COUNT(*) as `count`
FROM `postfix_mails`
WHERE `queued_at` LIKE '2016-04-11%'
  AND `host` = 'mta03'

The query takes between 100 and 110 ms.

Currently the table contains about 70 000 mails, and the query returns around 31 000. This is only a couple of days' worth of mails, and I plan to keep at least a month. The query cache doesn't help much because the table is getting updated constantly.

I have tried doing this instead:

SELECT SQL_NO_CACHE COUNT(*) as `count`
FROM `postfix_mails`
WHERE `queued_at` >= '2016-04-11'
  AND `queued_at` < '2016-04-12'
  AND `host` = 'mta03'

But the query takes the exact same time to run. I have made these changes to the MySQL configuration:

[mysqld]
query_cache_size = 128M
key_buffer_size = 256M

read_buffer_size = 128M
sort_buffer_size = 128M

innodb_buffer_pool_size = 4096M

And confirmed that they are all in effect (SHOW VARIABLES) but the query doesn't run any faster.

Am I doing something stupid that makes this query take this long? Can you spot any obvious or non-obvious ways to make it faster? Is there another database engine that works better than InnoDB in this scenario?

mysql> EXPLAIN SELECT SQL_NO_CACHE COUNT(*) as `count`
    -> FROM `postfix_mails`
    -> WHERE `queued_at` >= '2016-04-11'
    ->   AND `queued_at` < '2016-04-12'
    ->   AND `host` = 'mta03';
+----+-------------+---------------+------+--------------------------+----------------+---------+-------+-------+-------------+
| id | select_type | table         | type | possible_keys            | key            | key_len | ref   | rows  | Extra       |
+----+-------------+---------------+------+--------------------------+----------------+---------+-------+-------+-------------+
|  1 | SIMPLE      | postfix_mails | ref  | mail_signature,queued_at | mail_signature | 92      | const | 53244 | Using where |
+----+-------------+---------------+------+--------------------------+----------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)

Upvotes: 1

Answers (4)

Gordon Linoff

Reputation: 1269713

queued_at is a datetime value. Don't use LIKE. That converts it to a string, preventing the use of indexes and imposing a full-table scan. Instead, you want an appropriate index and to fix the query.

The query is:

SELECT COUNT(*) as `count`
FROM `postfix_mails`
WHERE `queued_at` >= '2016-04-11' AND `queued_at` < DATE_ADD('2016-04-11', interval 1 day) AND
      `host` = 'mta03';

Then you want a composite index on postfix_mails(host, queued_at). The host column needs to be first.

Note: If your current version is counting 31,000 out of 70,000 emails, then an index will not be much help for that. However, this will make the code more scalable for the future.

Upvotes: 2

Quassnoi

Reputation: 425341

If you need your query to be really fast, you'll need to materialize it.

MySQL lacks a way to do that natively, so you'll have to create a table like that:

CREATE TABLE mails_host_day
        (
        host VARCHAR(30) NOT NULL,
        day DATE NOT NULL,
        mails BIGINT NOT NULL,
        PRIMARY KEY (host, day)
        )

and update it either in a trigger on postfix_mails or with a script once in a while:

INSERT
INTO    mails_host_day (host, day, mails)
SELECT  host, CAST(queued_at AS DATE), COUNT(*)
FROM    postfix_mails
WHERE   id > :last_sync_id
GROUP BY
        host, CAST(queued_at AS DATE)
ON DUPLICATE KEY
UPDATE  mails = mails + VALUES(mails)

This way, querying a host-day entry is a single primary key seek.

Note that trigger-based solution will affect DML performance, while the script-based solution will result in slightly less actual data.

However, you can improve the script-based solution actuality if you union the most recent actual data with the stored results:

SELECT  host, day, SUM(mails) AS mails
FROM    (
        SELECT  host, day, mails
        FROM    mails_host_day
        UNION ALL
        SELECT  host, CAST(queued_at) AS day, COUNT(*) AS mails
        FROM    postfix_mails
        WHERE   id >= :last_sync_id
        GROUP BY
                host, CAST(queued_at) AS day
        ) q

It's not a single index seek anymore, however, if you run the update script often enough, there will be less actual records to read.

Upvotes: 1

Jaquarh

Reputation: 6683

You could use pagination to speed up queries in PHP which is usually how I resolve anything that contains a large amount of data - but this depends on your Table hierarchy.

Integrate your LIMIT in the SQL query.

PHP:

foreach ($db->Prepare("SELECT COUNT(*) as `count`
FROM `postfix_mails`
WHERE DATEDIFF(`queued_at`, '2016-04-11') = 0)
AND mail_id < :limit "))->execute(array(':limit' => $_POST['limit'])) as $row)
{
    // normal output
}

jQuery:

$(document).ready( function() {
    var starting = 1;
    $('#next').click( function() {
        starting = starting + 10;
        $.post('phpfilehere.php', { limit: starting })
            .done( function(data) {
                $('#mail-output').innerHTML = data;
            });
    );

);

Here, each page shows 10 emails on, of course you can change this and modify it and even add a search which I actually have an Object I use for all my Projects.

I just thought I'd share the idea - it also adds real-time data flow on your site too.

This was inspired to me by Facebook's scrolling show more - which really isn't hard but is such a good way for querying a lot of data.

Upvotes: 0

Mike Deluca

Reputation: 1210

You have a unique key on 'host', 'mail_id', and 'to', however when the query engine tries to use that index, you aren't filtering on 'mail_id' and 'to', so it may not be as efficient. A solution could be to add another index just on 'host' or add AND 'mail_id' IS NOT NULL AND'to' IS NOT NULL to your query to fully utilize the existing unique index.

Upvotes: 0

Is there any way to optimize this SELECT query any further?

Answers (4)

Related Questions