Reputation: 568
I have a database using the following schema:
CREATE TABLE IF NOT EXISTS `sessions` (
`starttime` datetime NOT NULL,
`ip` varchar(15) NOT NULL default '',
`country_name` varchar(45) default '',
`country_iso_code` varchar(2) default '',
`org` varchar(128) default '',
KEY (`ip`),
KEY (`starttime`),
KEY (`country_name`)
);
(The actual table contains more columns; I have included only the columns I query on.) The engine is InnoDB.
As you can see, there are 3 indices - on ip
, starttime
, and country_name
.
The table is very large - it contains something like 1.5 million rows. I am running various queries on it, trying to extract a month's worth of information (for the month of August, 2018, in the examples below).
A query like this
SELECT
UNIX_TIMESTAMP(starttime) as time_sec,
country_iso_code AS metric,
COUNT(country_iso_code) AS value
FROM
sessions
WHERE
starttime >= FROM_UNIXTIME(1533070800) AND
starttime <= FROM_UNIXTIME(1535749199)
GROUP BY metric;
is rather slow but bearable (tens of seconds), despite that there is no index on country_iso_code
.
(Ignore the first thing in the SELECT
; I know that it doesn't seem to make sense, but it is required in the tool that uses the result from the query. Similarly, ignore the use of FROM_UNIXTIME()
instead of a date string; this part of the query is generated automatically and I have no control over it.)
However, a query like this
SELECT
country_name AS Country,
COUNT(country_name) AS Attacks
FROM
sessions
WHERE
starttime >= FROM_UNIXTIME(1533070800) AND
starttime <= FROM_UNIXTIME(1535749199)
GROUP BY Country;
is unbearably slow - I let it run for about half an hour and gave up without getting any results.
Results from EXPLAIN
:
+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | sessions | NULL | index | starttime,starttime_2,country_name | country_name | 138 | NULL | 14771687 | 35.81 | Using where |
+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+
What exactly is the problem? Should I index on something else? Perhaps a composite index on (starttime
, country_name
)? I've read this guide but perhaps I've misunderstood it?
Here are some other queries that are similarly slow and probably suffering from the same problem:
Query #2:
SELECT
ip AS IP,
COUNT(ip) AS Attacks
FROM
sessions
WHERE
starttime >= FROM_UNIXTIME(1533070800) AND
starttime <= FROM_UNIXTIME(1535749199)
GROUP BY ip;
Results from EXPLAIN
:
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | sessions | NULL | index | starttime,ip,starttime_2 | ip | 47 | NULL | 14771780 | 35.81 | Using where |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
Query #3:
SELECT
org AS Organization,
COUNT(org) AS Attacks
FROM
sessions
WHERE
starttime >= FROM_UNIXTIME(1533070800) AND
starttime <= FROM_UNIXTIME(1535749199)
GROUP BY Organization;
Results from EXPLAIN
:
+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | sessions | NULL | index | starttime,starttime_2,org | org | 387 | NULL | 14771800 | 35.81 | Using where |
+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+
Query #4:
SELECT
ip AS IP,
country_name AS Country,
city_name AS City,
org AS Organization,
COUNT(ip) AS Attacks
FROM
sessions
WHERE
starttime >= FROM_UNIXTIME(1533070800) AND
starttime <= FROM_UNIXTIME(1535749199)
GROUP BY ip;
Results from EXPLAIN
:
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | sessions | NULL | index | starttime,ip,starttime_2 | ip | 47 | NULL | 14771914 | 35.81 | Using where |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
Upvotes: 8
Views: 19666
Reputation: 541
Try to:
SELECT
DISTINCT(u1.user_id), u1.*
FROM
table_users u1
WHERE
u1.last_update_date = (SELECT MAX(u2.last_update_date)
FROM table_users u2
WHERE u2.user_id = u1.user_id);
Upvotes: 0
Reputation: 142298
Even better...
Note that you do not have a PRIMARY KEY
; that is naughty. Having a PK won't intrinsically improve performance, but having the PK start with starttime
will. Let's do this:
CREATE TABLE IF NOT EXISTS `sessions` (
id INT UNSIGNED NOT NULL AUTO_INCREMENT, -- note
`starttime` datetime NOT NULL,
`ip` varchar(39) NOT NULL CHARACTER SET ascii default '', -- note
`country_name` varchar(45) default '',
`country_iso_code` char(2) CHARACTER SET ascii default '', -- note
`org` varchar(128) default '',
PRIMARY KEY(starttime, id) -- in this order
INDEX(id) -- to keep AUTO_INCREMENT happy
-- The rest are unnecessary for the queries in question:
KEY (`ip`),
KEY (`starttime`),
KEY (`country_name`)
) ENGINE=InnoDB; -- just in case you are accidentally getting MyISAM
Why? This will take advantage of the "clustering" of the PK with the data. That way, only a section of the table that in in the time range will be scanned. And there won't be bouncing between the index and the data. And you won't need a lot of indexes to efficiently do all the cases.
IPv6 needs up to 39 bytes. Note that VARCHAR
will not let you do any range (CDR) tests. I can discuss that further you like.
Upvotes: 3
Reputation: 108651
In general, queries of the form
SELECT column, COUNT(column)
FROM tbl
WHERE datestamp >= a AND datestamp <= b
GROUP BY column
perform best when the table has a compound index on (datestamp, column)
. Why? They can be satisfied by an index scan rather than needing to read all the rows of the table.
In other words, the first relevant row for the query can be located by random-accessing the index (to the first value of the datestamp). Then, MySQL can read the index sequentially and count the various values in the column
until it hits the last relevant row. There's no need to read the actual table; the query is satisfied from the index alone. That makes it faster.
UPDATE TABLE tbl ADD INDEX date_col (datestamp, column);
creates the index for you.
Beware two things. One: single column indexes don't necessarily help aggregate query performance.
Two: It's hard to guess the right index to use to get an index scan without seeing the whole query. Simplified queries often lead to oversimplified indexes.
Upvotes: 13