Reputation: 1966
I have a table with more than a million entries and around 42 columns. I am trying to run SELECT query on this table which takes a minute to execute. In order to reduce the query execution time I added an index on the table, but the index is not being used.
The table structure is as follows. Though the table has 42 columns I am only showing here those that are relevant to my query
CREATE TABLE `tas_usage` (
`uid` int(11) NOT NULL AUTO_INCREMENT,
`userid` varchar(255) DEFAULT NULL,
`companyid` varchar(255) DEFAULT NULL,
`SERVICE` varchar(2000) DEFAULT NULL,
`runstatus` varchar(255) DEFAULT NULL,
`STATUS` varchar(2000) DEFAULT NULL,
`servertime` datetime DEFAULT NULL,
`machineId` varchar(2000) DEFAULT NULL,
PRIMARY KEY (`uid`)
) ENGINE=InnoDB AUTO_INCREMENT=2992891 DEFAULT CHARSET=latin1
The index that I have added is as follows
ALTER TABLE TAS_USAGE ADD INDEX last_quarter (SERVERTIME,COMPANYID(20),MACHINEID(20),SERVICE(50),RUNSTATUS(10));
My SELECT Query
EXPLAIN SELECT DISTINCT t1.COMPANYID, t1.USERID, t1.MACHINEID FROM TAS_USAGE t1
LEFT JOIN TAS_INVALID_COMPANY INVL ON INVL.COMPANYID = t1.COMPANYID
LEFT JOIN TAS_INVALID_MACHINE INVL_MAC_ID ON INVL_MAC_ID.MACHINEID = t1.MACHINEID
WHERE t1.SERVERTIME >= '2018-10-01 00:00:00' AND t1.SERVERTIME <= '2018-12-31 00:00:00' AND
INVL.companyId IS NULL AND INVL_MAC_ID.machineId IS NULL AND
t1.SERVICE NOT IN ('credentialtest%', 'webupdate%') AND
t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed', 'Failed Success', 'Success Failed', '');
EXPLAIN result is as follows
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
| 1 | SIMPLE | t1 | NULL | ALL | last_quarter | NULL | NULL | NULL | 1765296 | 15.68 | Using where; Using temporary |
| 1 | SIMPLE | INVL | NULL | ref | invalid_company_index | invalid_company_index | 502 | servicerunprod.t1.companyid | 1 | 100.00 | Using where; Not exists; Using index; Distinct |
| 1 | SIMPLE | INVL_MAC_ID | NULL | eq_ref | machineId | machineId | 502 | servicerunprod.t1.machineId | 1 | 100.00 | Using where; Not exists; Using index; Distinct |
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
Explanation of my Query
I want to select all the records from table TAS_USAGE
COMPANYID
and MACHINEID
matching in
tables TAS_INVALID_COMPANY
and TAS_INVALID_MACHINE
ANDSERVICE
column and values ('Failed', 'Failed Failed', 'Failed
Success', 'Success Failed', '') in RUNSTATUS
columnUpvotes: 0
Views: 2594
Reputation: 142298
WHERE t1.SERVERTIME >= '2018-10-01 00:00:00'
AND t1.SERVERTIME <= '2018-12-31 00:00:00'
is strange. It covers 3 months minus 1 day plus 1 second. Suggest you rephrase thus:
WHERE t1.SERVERTIME >= '2018-10-01'
AND t1.SERVERTIME < '2018-10-01' + INTERVAL 3 MONTH
There are multiple possible reasons why the INDEX(servertime, ...)
was not used and/or was not "useful" even if used:
foo(10)
) is next to useless.What you can do:
SMALLINT UNSIGNED
with a max of 65K) will save a lot of space in this table. This, in turn, will speed up the query, and eliminate the desire for index prefixing.VARCHAR
. If you get it under 255, prefixing is no longer needed.NOT IN
is not optimizable. If you can invert the test and make it IN(...)
, more possibilities open up, such as INDEX(service, runstatus, servertime)
. If you have a new enough version of MySQL, I think the optimizer will hop around in the index on the two IN
columns and use the index for the time range. NOT IN ('credentialtest%', 'webupdate%')
-- Is %
part of the string? If you are using %
as a wildcard, that construct will not work. You would need two LIKE
clauses.Reformulate the query thus:
SELECT t1.COMPANYID, t1.USERID, t1.MACHINEID
FROM TAS_USAGE t1
WHERE t1.SERVERTIME >= '2018-10-01'
AND t1.SERVERTIME < '2018-10-01' + INTERVAL 3 MONTH
AND t1.SERVICE NOT IN ('credentialtest%', 'webupdate%')
AND t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed',
'Failed Success', 'Success Failed', '')
AND NOT EXISTS( SELECT 1 FROM TAS_INVALID_COMPANY WHERE companyId = t1.COMPANYID )
AND NOT EXISTS( SELECT 1 FROM TAS_INVALID_MACHINE WHERE MACHINEID = t1.MACHINEID );
If the trio t1.COMPANYID, t1.USERID, t1.MACHINEID
is unique, then get rid of DISTINCT
.
Since there are only 6 (of 42) columns being used in this query, building a "covering" index will probably help:
INDEX(SERVERTIME, SERVICE, RUNSTATUS, COMPANYID, USERID, MACHINEID)
This is because the query can be performed entirely withing the index. In this case, I deliberately put the range first.
Upvotes: 1
Reputation: 11106
Focussing on the date range, MySQL basically has two options :
read the complete table consecutively and throw away records that do not fit the date range
use the index to identify the records in the date range and then look up each record in the table (using the primary key) individually ("random access")
Consecutive reads are significantly faster than random access, but you need to read more data. There will be some break-even point at which using an index will become slower than just simply reading everything, and MySQL assumes this is the case here. If that's the right choice will largely depend on how correctly it guessed how many records are actually in the range. If you make the range smaller, it should actually use the index at some point.
If you know that (or want to test if) using the index is faster, you can force MySQL to use it with
... FROM TAS_USAGE t1 force index (last_quarter) LEFT JOIN ...
You should test it with different ranges, and if you generate your query dynamically, only force the index when you are decently certain (as MySQL will not correct you if you e.g. specify a range that would include all rows).
There is one important way around the slow random access to the table, although it unfortunately does not work with your prefixed index, but I mention it in case you can reduce your field sizes (or change them to lookups/enums). You can include every column that MySQL needs to evaluate the query by using a covering index:
An index that includes all the columns retrieved by a query. Instead of using the index values as pointers to find the full table rows, the query returns values from the index structure, saving disk I/O.
As mentioned, since in a prefixed index, part of the data is missing, those columns unfortunately cannot be used to cover though.
Actually, they also cannot be used for much at all, especially not to filter records before doing the random access, as to evaluate your where
-condition for RUNSTATUS
or SERVICE
, the complete value is required anyway. So you could check if e.g. RUNSTATUS
is very significant - maybe 99% of your records are in status 'Failed' - and in that case add an unprefixed filter for just
(SERVERTIME, RUNSTATUS)
(and MySQL might even pick that index then on its own).
Upvotes: 1
Reputation: 34231
The distinct
clause is the one that interferes with the index usage. Since the index cannot be used to help with the distinct, mysql decided against the use of index completely.
If you rearrange the order of fields in the select list, in the index, and in the where clause, mysql may decide to use it:
ALTER TABLE TAS_USAGE ADD INDEX last_quarter (COMPANYID(20),MACHINEID(20), SERVERTIME, SERVICE(50),RUNSTATUS(10));
SELECT DISTINCT t1.COMPANYID, t1.MACHINEID, t1.USERID FROM TAS_USAGE t1
LEFT JOIN TAS_INVALID_COMPANY INVL ON INVL.COMPANYID = t1.COMPANYID
LEFT JOIN TAS_INVALID_MACHINE INVL_MAC_ID ON INVL_MAC_ID.MACHINEID = t1.MACHINEID
WHERE
INVL.companyId IS NULL AND INVL_MAC_ID.machineId IS NULL AND
t1.SERVERTIME >= '2018-10-01 00:00:00' AND t1.SERVERTIME <= '2018-12-31 00:00:00' AND
t1.SERVICE NOT IN ('credentialtest%', 'webupdate%') AND
t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed', 'Failed Success', 'Success Failed', '');
This way COMPANYID, MACHINEID
fields become the leftmost fields in the distinct, where, and index - although the prefix may result in the index still to be discarded. You may want to consider reducing your varchar(255)
fields.
Upvotes: 0