Speeding up large MySQL query

Question

The following query hangs on the "sending data" phase for an incredibly long time. It is a large query but im hoping to get some assistance with my indexes and possibly learn a bit more about how MySQL actually chooses which index its going to use.

Below is the query as well as a DESCRIBE statement output.

mysql> DESCRIBE SELECT e.employee_number, s.current_status_start_date, e.company_code, e.location_code, s.last_suffix_first_mi, s.job_title, SUBSTRING(e.job_code,1,1) tt_jobCode,
->                      SUM(e.current_amount) tt_grossWages,
->                      IFNULL((SUM(e.current_amount) - IF(tt1.tt_reduction = '','0',tt1.tt_reduction)),SUM(e.current_amount)) tt_taxableWages,
->                      t.new_code, STR_TO_DATE(s.last_hire_date, '%Y-%m-%d') tt_hireDate,
->                      IF(s.current_status_code = 'T',STR_TO_DATE(s.current_status_start_date, '%Y-%m-%d'),'') tt_terminationDate,
->                      IFNULL(tt_totalHours,'0') tt_totalHours
->               FROM check_earnings e
->               LEFT JOIN (
->                          SELECT * FROM summary
->                          GROUP BY employee_no
->                          ORDER BY current_status_start_date DESC
->                         ) s
->               ON e.employee_number = s.employee_no
->               LEFT JOIN (
->                          SELECT employee_no, SUM(current_amount__employee) tt_reduction
->                          FROM check_deductions
->                          WHERE STR_TO_DATE(pay_date, '%Y-%m-%d') >= STR_TO_DATE('2012-06-01', '%Y-%m-%d')
->                          AND STR_TO_DATE(pay_date, '%Y-%m-%d') <= STR_TO_DATE('2013-06-01', '%Y-%m-%d')
->                          AND (
->                               deduction_code IN ('DECMP','FSAM','FSAC','DCMAK','DCMAT','401KD')
->                               OR deduction_code LIKE 'IM%'
->                               OR deduction_code LIKE 'ID%'
->                               OR deduction_code LIKE 'IV%'
->                               )
->                          GROUP BY employee_no
->                          ORDER BY employee_no ASC
->                          ) tt1
->               ON e.employee_number = tt1.employee_no
->               LEFT JOIN translation t
->               ON e.location_code = t.old_code
->               LEFT JOIN (
->                          SELECT employee_number, SUM(current_hours) tt_totalHours
->                          FROM check_earnings
->                          WHERE STR_TO_DATE(pay_date, '%Y-%m-%d') >= STR_TO_DATE('2012-06-01', '%Y-%m-%d')
->                          AND STR_TO_DATE(pay_date, '%Y-%m-%d') <= STR_TO_DATE('2013-06-01', '%Y-%m-%d')
->                          AND earnings_code IN ('REG1','REG2','REG3','REG4')
->                          GROUP BY employee_number
->                         ) tt2
->               ON e.employee_number = tt2.employee_number
->               WHERE STR_TO_DATE(e.pay_date, '%Y-%m-%d') >= STR_TO_DATE('2012-06-01', '%Y-%m-%d')
->               AND STR_TO_DATE(e.pay_date, '%Y-%m-%d') <= STR_TO_DATE('2013-06-01', '%Y-%m-%d')
->               AND SUBSTRING(e.job_code,1,1) != 'E'
->               AND e.location_code != '639'
->               AND t.field = 'location_state'
->               GROUP BY e.employee_number
->               ORDER BY s.current_status_start_date DESC, e.location_code ASC, s.last_suffix_first_mi ASC;

+----+-------------+------------------+-------+----------------+-----------------+---------+----------------------------+---------+----------------------------------------------+
| id | select_type | table            | type  | possible_keys  | key             | key_len | ref                        | rows    | Extra                                        |
+----+-------------+------------------+-------+----------------+-----------------+---------+----------------------------+---------+----------------------------------------------+
|  1 | PRIMARY     | e                | ALL   | location_code  | NULL            | NULL    | NULL                       | 3498603 | Using where; Using temporary; Using filesort |
|  1 | PRIMARY     | t                | ref   | field,old_code | old_code        | 303     | historical.e.location_code |       1 | Using where                                  |
|  1 | PRIMARY     |        | ALL   | NULL           | NULL            | NULL    | NULL                       |   16741 |                                              |
|  1 | PRIMARY     |        | ALL   | NULL           | NULL            | NULL    | NULL                       |    2530 |                                              |
|  1 | PRIMARY     |        | ALL   | NULL           | NULL            | NULL    | NULL                       |    2919 |                                              |
|  4 | DERIVED     | check_earnings   | index | NULL           | employee_number | 303     | NULL                       | 3498603 | Using where                                  |
|  3 | DERIVED     | check_deductions | index | deduction_code | employee_no     | 303     | NULL                       | 6387048 | Using where                                  |
|  2 | DERIVED     | summary          | index | NULL           | employee_no     | 303     | NULL                       |   17608 | Using temporary; Using filesort              |
+----+-------------+------------------+-------+----------------+-----------------+---------+----------------------------+---------+----------------------------------------------+
8 rows in set, 65535 warnings (32.77 sec)

EDIT: After playing with some indexes, it now spends the most time in the "Copying to tmp table" state.

Bill Karwin · Accepted Answer

There's no way you can avoid use of a temp table in that query. One reason is that you are grouping by different columns than you are sorting by.

Another reason is the use of derived tables (subqueries in the FROM/JOIN clauses).

One way you could speed this up is to create summary tables to store the result of those subqueries so you don't have to do them during every query.

You are also forcing table-scans by searching on the result of functions like STR_TO_DATE() and SUBSTR(). These cannot be optimized with an index.

Re your comment:

I can make an SQL query against a far smaller table run for 72 hours with a poorly-optimized query.

Note for example in the output of your DESCRIBE, it shows "ALL" for several of the tables involved in the join. This means it has to do a table-scan of all the rows (shown in the 'rows' column).

A rule of thumb: how many row comparisons does it take to resolve the join? Multiple the 'rows' of all the tables joined together with the same 'id'.

+----+-------------+------------------+-------+---------+
| id | select_type | table            | type  | rows    |
+----+-------------+------------------+-------+---------+
|  1 | PRIMARY     | e                | ALL   | 3498603 |
|  1 | PRIMARY     | t                | ref   |       1 |
|  1 | PRIMARY     |        | ALL   |   16741 |
|  1 | PRIMARY     |        | ALL   |    2530 |
|  1 | PRIMARY     |        | ALL   |    2919 |

So it may be evaluating the join conditions 432,544,383,105,752,610 times (assume those numbers are approximate, so it may not really be as bad as that). It's actually a miracle it takes only 5 hours!

What you need to do is use indexes to help the query reduce the number of rows it needs to examine.

For example, why are you using STR_TO_DATE() given that the date you are parsing is the native date format for MySQL? Why don't you store those columns as a DATE data type? Then the search could use an index.

You don't need to "play with indexes." It's not like indexing is a mystery or has random effects. See my presentation How to Design Indexes, Really for some introduction.

Speeding up large MySQL query

Answers (1)

Related Questions