Three Queries Faster than One -- What's Wrong with my Joins?

Question

I've got a JPA ManyToMany relationship set up, which gives me three important tables: my Ticket table, my Join table, and my Inventory table. They're InnoDB tables on MySQL 5.1. The relevant bits are:

Ticket:
+--------+----------+------+-----+---------+----------------+
| Field  | Type     | Null | Key | Default | Extra          |
+--------+----------+------+-----+---------+----------------+
| ID     | int(11)  | NO   | PRI | NULL    | auto_increment |
| Status | longtext | YES  |     | NULL    |                |
+--------+----------+------+-----+---------+----------------+

JoinTable:
+-------------+---------+------+-----+---------+-------+
| Field       | Type    | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| InventoryID | int(11) | NO   | PRI | NULL    |       | Foreign Key - Inventory
| TicketID    | int(11) | NO   | PRI | NULL    |       | Foreign Key - Ticket
+-------------+---------+------+-----+---------+-------+

Inventory:
+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| ID           | int(11)      | NO   | PRI | NULL    | auto_increment |
| TStampString | varchar(32)  | NO   | MUL | NULL    |                |
+--------------+--------------+------+-----+---------+----------------+

TStampStrings are of the form "yyyy.mm.dd HH:MM:SS Z" (for example, '2010.03.19 22:27:57 GMT'). Right now all of the Tickets created directly correspond to some specific hour TStampString, so that SELECT COUNT(*) FROM Ticket; is the same as SELECT COUNT(DISTINCT(SUBSTRING(TStampString, 1, 13))) FROM Inventory;

What I'd like to do is regroup certain Tickets based on the minute granularity of a TStampString: (SUBSTRING(TStampString, 1, 16)). So I'm profiling and testing the SELECT of an INSERT INTO ... SELECT statement:

EXPLAIN SELECT SUBSTRING(i.TStampString, 1, 16) FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY SUBSTRING(i.TStampString, 1, 16);

+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|id| type |tbl| type   | psbl_keys   | key | len | ref      | rows  | Extra     |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|1 | SMPL | t | ALL    | PRI         | NULL| NULL| NULL     | 35569 | where     |
|  |      |   |        |             |     |     |          |       | +temporary|
|  |      |   |        |             |     |     |          |       | +filesort |
|1 | SMPL | j | ref    | PRI,FK1,FK2 | FK2 | 4   | t.ID     |   378 | index     |
|1 | SMPL | i | eq_ref | PRI         | PRI | 4   | j.Invent |     1 |           |
|  |      |   |        |             |     |     |    oryID |       |           |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+

What this implies to me is that for each row in Ticket, MySQL first does the joins then later decides that the row is invalid due to the WHERE clause. Certainly the runtime is abominable (I gave up after 30 minutes). Note that it goes no faster with t.Status = 'Regroup' moved to the first JOIN clause and no WHERE clause.

But what's interesting is that if I run this query manually in three steps, doing what I thought the optimizer would do, each step returns almost immediately:

--Step 1: Select relevant Tickets (results dumped to file)
SELECT ID FROM Ticket WHERE Status = 'Regroup';

--Step 2: Get relevant Inventory entries
SELECT InventoryID FROM JoinTable WHERE TicketID IN (step 1s file);

--Step 3: Select what I wanted all along
SELECT SUBSTRING(TStampString, 1, 16) FROM Inventory WHERE ID IN (step 2s file)
GROUP BY SUBSTRING(TStampString, 1, 16);

On my particular tables, the first query gives 154 results, the second creates 206,598 lines, and the third query returns 9198 rows. All of them combined take ~2 minutes to run, with the last query having the only significant runtime.

Dumping the intermediate results to a file is cumbersome, and more importantly I'd like to know how to write my original query such that it runs reasonably. So how do I structure this three-table-join such that it runs as fast as I know is possible?

UPDATE: I've added a prefix index on Status(16), which changes my EXPLAIN profile rows to 153, 378, and 1 respectively (since the first row has a key to use). The JOIN version of my query now takes ~6 minutes, which is tolerable but still considerably slower than the manual version. I'd still like to know why the join is performing sorely suboptimally, but it may be that one can't create independent subqueries in buggy MySQL 5.1. If enough time passes I'll accept Add Index as the solution to my problem, although it's not exactly the answer to my question.

In the end I did end up manually recreating every step of the join on disk. Tens of thousands of files each with a thousand queries was still significantly faster than anything I could get my version of MySQL to do. But since that process would be horribly specific and unhelpful for the layman, I'm accepting ypercube's answer of Add (Partial) Indexes.

ypercubeᵀᴹ · Accepted Answer

What you can do to speed up the query:

Add an index on Status. Even if you don't change the type to VARCHAR, you can still add a partial index:
```
ALTER TABLE Ticket
  ADD INDEX status_idx
    Status(16) ;
```
I assume that the Primary key of the Join table is (InventoryID, TicketID). You can add another index on (TicketID, InventoryID) as well. This may not benefit this particular query but it will be helpful in other queries you'll have.

The answer on why this happens is that the optimizer does not always choose the best plan. You can try this variation of your query and see how the EXPLAIN plan differs and if there is any efficiency gain:

SELECT SUBSTRING(i.TStampString, 1, 16) 
FROM 
    ( SELECT (DISTINCT) j.InventoryID 
      FROM Ticket t 
        JOIN JoinTable j
          ON t.ID = j.TicketID 
      WHERE t.Status = 'Regroup' 
    ) AS tmp
  JOIN Inventory i 
    ON tmp.InventoryID = i.ID
GROUP BY SUBSTRING(i.TStampString, 1, 16) ;

Three Queries Faster than One -- What's Wrong with my Joins?

Answers (2)

Related Questions

Three Queries Faster than One -- What&#39;s Wrong with my Joins?

Answers (2)

Related Questions

Three Queries Faster than One -- What's Wrong with my Joins?