bontchev
bontchev

Reputation: 568

MySQL query using GROUP BY is extremely slow

I have a database using the following schema:

CREATE TABLE IF NOT EXISTS `sessions` (
  `starttime` datetime NOT NULL,
  `ip` varchar(15) NOT NULL default '',
  `country_name` varchar(45) default '',
  `country_iso_code` varchar(2) default '',
  `org` varchar(128) default '',
  KEY (`ip`),
  KEY (`starttime`),
  KEY (`country_name`)
);

(The actual table contains more columns; I have included only the columns I query on.) The engine is InnoDB.

As you can see, there are 3 indices - on ip, starttime, and country_name.

The table is very large - it contains something like 1.5 million rows. I am running various queries on it, trying to extract a month's worth of information (for the month of August, 2018, in the examples below).

A query like this

SELECT
  UNIX_TIMESTAMP(starttime) as time_sec,
  country_iso_code AS metric,
  COUNT(country_iso_code) AS value
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY metric;

is rather slow but bearable (tens of seconds), despite that there is no index on country_iso_code.

(Ignore the first thing in the SELECT; I know that it doesn't seem to make sense, but it is required in the tool that uses the result from the query. Similarly, ignore the use of FROM_UNIXTIME() instead of a date string; this part of the query is generated automatically and I have no control over it.)

However, a query like this

SELECT
  country_name AS Country,
  COUNT(country_name) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY Country;

is unbearably slow - I let it run for about half an hour and gave up without getting any results.

Results from EXPLAIN:

+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys                      | key          | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,starttime_2,country_name | country_name | 138     | NULL | 14771687 |    35.81 | Using where |
+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+

What exactly is the problem? Should I index on something else? Perhaps a composite index on (starttime, country_name)? I've read this guide but perhaps I've misunderstood it?

Here are some other queries that are similarly slow and probably suffering from the same problem:

Query #2:

SELECT
  ip AS IP,
  COUNT(ip) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY ip;

Results from EXPLAIN:

+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys            | key  | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,ip,starttime_2 | ip   | 47      | NULL | 14771780 |    35.81 | Using where |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+

Query #3:

SELECT
  org AS Organization,
  COUNT(org) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY Organization;

Results from EXPLAIN:

+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys             | key  | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,starttime_2,org | org  | 387     | NULL | 14771800 |    35.81 | Using where |
+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+

Query #4:

SELECT
  ip AS IP,
  country_name AS Country,
  city_name AS City,
  org AS Organization,
  COUNT(ip) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY ip;

Results from EXPLAIN:

+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys            | key  | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,ip,starttime_2 | ip   | 47      | NULL | 14771914 |    35.81 | Using where |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+

Upvotes: 8

Views: 19666

Answers (3)

Bang Andre
Bang Andre

Reputation: 541

Try to:

SELECT 
    DISTINCT(u1.user_id), u1.*
FROM
    table_users u1
WHERE
    u1.last_update_date = (SELECT MAX(u2.last_update_date)
        FROM table_users u2
        WHERE u2.user_id = u1.user_id);

Upvotes: 0

Rick James
Rick James

Reputation: 142298

Even better...

Note that you do not have a PRIMARY KEY; that is naughty. Having a PK won't intrinsically improve performance, but having the PK start with starttime will. Let's do this:

CREATE TABLE IF NOT EXISTS `sessions` (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,   -- note
  `starttime` datetime NOT NULL,
  `ip` varchar(39) NOT NULL CHARACTER SET ascii default '',  -- note
  `country_name` varchar(45) default '',
  `country_iso_code` char(2) CHARACTER SET ascii  default '',  -- note
  `org` varchar(128) default '',
  PRIMARY KEY(starttime, id)  -- in this order
  INDEX(id)                   -- to keep AUTO_INCREMENT happy
  -- The rest are unnecessary for the queries in question:
  KEY (`ip`),
  KEY (`starttime`),
  KEY (`country_name`)
) ENGINE=InnoDB;        -- just in case you are accidentally getting MyISAM

Why? This will take advantage of the "clustering" of the PK with the data. That way, only a section of the table that in in the time range will be scanned. And there won't be bouncing between the index and the data. And you won't need a lot of indexes to efficiently do all the cases.

IPv6 needs up to 39 bytes. Note that VARCHAR will not let you do any range (CDR) tests. I can discuss that further you like.

Upvotes: 3

O. Jones
O. Jones

Reputation: 108651

In general, queries of the form

  SELECT column, COUNT(column)
    FROM tbl
   WHERE datestamp >= a AND datestamp <= b
   GROUP BY column

perform best when the table has a compound index on (datestamp, column). Why? They can be satisfied by an index scan rather than needing to read all the rows of the table.

In other words, the first relevant row for the query can be located by random-accessing the index (to the first value of the datestamp). Then, MySQL can read the index sequentially and count the various values in the column until it hits the last relevant row. There's no need to read the actual table; the query is satisfied from the index alone. That makes it faster.

UPDATE TABLE tbl ADD INDEX date_col (datestamp, column);

creates the index for you.

Beware two things. One: single column indexes don't necessarily help aggregate query performance.

Two: It's hard to guess the right index to use to get an index scan without seeing the whole query. Simplified queries often lead to oversimplified indexes.

Upvotes: 13

Related Questions