Florent Poujol
Florent Poujol

Reputation: 740

Optimize query with 1 join, on tables with 10+ millions rows

I am looking at making a request using 2 tables faster.
I have the following 2 tables :

Table "logs"

That table has what the PHP Laravel Framework calls a "polymorphic many to many" relationship with several other objects, so there is a second table "logs_pivot" :

There is one or several entries in logs_pivot per entry in logs. They have 20+ and 10+ millions of rows, respectively.

We do queries like so :

select * from logs 
join logs_pivot on logs.id = logs_pivot.log_id
where model_id = 'some_id' and model_type = 'My\Class'
order by date desc
limit 50;

Obviously we have a compound index on both the model_id and model_type fields, but the requests are still slow : several (dozens of) seconds every times.
We also have an index on the date field, but an EXPLAIN show that this is the model_id_model_type index that is used.

Explain statement:

+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+
| id | select_type | table       | partitions | type   | possible_keys                                                                  | key                                           | key_len | ref                                       | rows | filtered | Extra                           |
+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+
|  1 | SIMPLE      | logs_pivot  | NULL       | ref    | logs_pivot_model_id_model_type_index,logs_pivot_log_id_index | logs_pivot_model_id_model_type_index | 364     | const,const                               |    1 |   100.00 | Using temporary; Using filesort |
|  1 | SIMPLE      | logs        | NULL       | eq_ref | PRIMARY                                                                        | PRIMARY                                       | 146     | the_db_name.logs_pivot.log_id |    1 |   100.00 | NULL                            |
+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+

In other tables, I was able to make a similar request much faster by including the date field in the index. But in that case they are in a separate table.

When we want to access these data, they are typically a few hours/days old.
Our InnoDB pools are much too small to hold all that data (+ all the other tables) in memory, so the data is most probably always queried on disk.

What would be all the ways we could make that request faster ?
Ideally only with another index, or by changing how it is done.

Thanks a lot !


Edit 17h05 :
Thank you all for your answers so far, I will try something like O Jones suggest, and also to somehow include the date field in the pivot table, so that I can include in the index index.


Edit 14/10 10h.

Solution :

So I ended up changing how the request was really done, by sorting on the id field of the pivot table, which indeed allow to put it in an index.

Also the request to count the total number of rows is changed to only be done on the pivot table, when it is not filtered by date.

Thank you all !

Upvotes: 0

Views: 807

Answers (3)

O. Jones
O. Jones

Reputation: 108706

SELECT columns FROM big table ORDER BY something LIMIT small number is a notorious query performance antipattern. Why? the server sorts a whole mess of long rows then discards almost all of them. It doesn't help that one of your columns is a LOB -- a TEXT column.

Here's an approach that can reduce that overhead: Figure out which rows you want by finding the set of primary keys you want, then fetch the content of only those rows.

What rows do you want? This subquery finds them.

                  SELECT id
                    FROM logs
                    JOIN logs_pivot 
                            ON logs.id = logs_pivot.log_id
                   WHERE logs_pivot.model_id = 'some_id'
                     AND logs_pivot.model_type = 'My\Class'
                   ORDER BY logs.date DESC
                   LIMIT 50

This does all the heavy lifting of working out the rows you want. So, this is the query you need to optimize.

It can be accelerated by this index on logs

CREATE INDEX logs_date_desc ON logs (date DESC);

and this three-column compound index on logs_pivot

CREATE INDEX logs_pivot_lookup ON logs_pivot (model_id, model_type, log_id);

This index is likely to be better, since the Optimizer will see the filtering on logs_pivot but not logs. Hence, it will look in logs_pivot first.

Or maybe

CREATE INDEX logs_pivot_lookup ON logs_pivot (log_id, model_id, model_type);

Try one then the other to see which yields faster results. (I'm not sure how the JOIN will use the compound index.) (Or simply add both, and use EXPLAIN to see which one it uses.)

Then, when you're happy -- or satisfied anyway -- with the subquery's performance, use it to grab the rows you need, like this

SELECT * 
  FROM logs
  WHERE id IN (
                  SELECT id
                    FROM logs
                    JOIN logs_pivot 
                            ON logs.id = logs_pivot.log_id
                   WHERE logs_pivot.model_id = 'some_id'
                     AND model_type = 'My\Class'
                   ORDER BY logs.date DESC
                   LIMIT 50
              )
  ORDER BY date DESC

This works because it sorts less data. The covering three-column index on logs_pivot will also help.

Notice that both the sub query and main query have ORDER BY clauses, to make sure the returned detail result set is in the order you need.

Edit Darnit, been on MariaDB 10+ and MySQL 8+ so long I forgot about the old limitation. Try this instead.

SELECT * 
  FROM logs
  JOIN (
                  SELECT id
                    FROM logs
                    JOIN logs_pivot 
                            ON logs.id = logs_pivot.log_id
                   WHERE logs_pivot.model_id = 'some_id'
                     AND model_type = 'My\Class'
                   ORDER BY logs.date DESC
                   LIMIT 50
        ) id_set ON logs.id = id_set.id
  ORDER BY date DESC

Finally, if you know you only care about rows newer than some certain time you can add something like this to your subquery.

                  AND logs.date >= NOW() - INTERVAL 5 DAY

This will help a lot if you have tonnage of historical data in your table.

Upvotes: 1

DRapp
DRapp

Reputation: 48139

Just a suggestion. Using a compound index is obviously a good thing. Another might be to pre-qualify an ID by date, and extend your index based on your logs_pivot table indexing on (model_id, model_type, log_id ).

If your querying data, and the entire history is 20+ million records, how far back does the data go where you are only dealing with getting a limit of 50 records per given category of model id/type. Say 3-months? vs say your log of 5 years? (not listed in post, just a for-instance). So if you can query the minimum log ID where the date is greater than say 3 months back, that one ID can limit what else is going on from your logs_pivot table.

Something like

select
      lp.*,
      l.date
   from
      logs_pivot lp
         JOIN Logs l
            on lp.log_id = l.id
   where
          model_id = 'some_id' 
      and model_type = 'My\Class'
      and log_id >= ( select min( id )
                         from logs
                        where date >= datesub( curdate(), interval 3 month ))
   order by 
      l.date desc
   limit  
      50;

So, the where clause for the log_id is done once and returns just an ID from as far back as 3 months and not the entire history of the logs_pivot. Then you query with the optimized two-part key of model id/type, but also jumping to the end of its index with the ID included in the index key to skip over all the historical.

Another thing you MAY want to include are some pre-aggregate tables of how many records such as per month/year per given model type/id. Use that as a pre-query to present to users, then you can use that as a drill-down to further get more detail. A pre-aggregate table can be done on all the historical stuff once since it would be static and not change. The only one you would have to constantly update would be whatever the current single month period is, such as on a nightly basis. Or even possibly better, via a trigger that either inserts a record every time an add is done, or updates a count for the given model/type based on year/month aggregations. Again, just a suggestion as no other context on how / why the data will be presented to the end-user.

Upvotes: 1

Rick James
Rick James

Reputation: 142316

I see two problems:

  • UUIDs are costly when tables are huge relative to RAM size.

  • The LIMIT cannot be handled optimally because the WHERE clauses come from one table, but the ORDER BY column comes from another table. That is, it will do all of the JOIN, then sort and finally peel off a few rows.

Upvotes: 1

Related Questions