Dandy
Dandy

Reputation: 21

Help with Advanced MySQL Optimization

I've run into a problem with an SQL query, that is "failing" (taking too long) when the tables have more than 100k records. This should not be a problem, and I thought I had it covered as it works pretty fine for 50k records.

I'll try to be brief and clear, so I'll start with the query:

SELECT
    V.id
FROM
    videos V
    LEFT JOIN videos_categories VC ON V.id = VC.video_id
    LEFT JOIN categories C ON VC.category_id = C.id
    LEFT JOIN users U ON V.user_id = U.id -- irrelevant table. Don't pay attention
WHERE
    V.status = 1
    AND (C.status = 1 OR C.id IS NULL)
    AND (U.status = 1 OR U.id IS NULL) -- irrelevant
GROUP BY V.id
ORDER BY V.id DESC
LIMIT 0, 12

---------------------------------------------
**Query took 10.8771 sec** (very bad! this would take 0.1 max)

I'm using all LEFT JOINs because I don't want to restrain results if a category doesn't exist. This means that videos without assigned categories are also returned.

The idea of the tables structure is the next:

---- UPDATE July 3 ----

Structure for the tables:

CREATE TABLE `videos` ( -- Holding +100k records
    `id` int(10) unsigned NOT NULL auto_increment,
    `user_id` int(10) unsigned NOT NULL default '0', -- irrelevant for this example
    `status` tinyint(1) NOT NULL default '0',
    PRIMARY KEY  (`id`),
    KEY `status` (`status`)
    -- ... -- Irrelevant Keys
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC AUTO_INCREMENT=113339 ;


CREATE TABLE `videos_categories` (  -- Holding +600k records (several categories per video)
    `video_id` int(10) unsigned NOT NULL default '0',
    `category_id` int(10) unsigned NOT NULL default '0',
    KEY `video_id` (`video_id`),
    KEY `category_id` (`category_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Categories table has a PK id, and irrelevant fields. It holds 80 records. Users table is completely irrelevant and may be ignored. Sorry for adding it in the first instance.

---- END OF UPDATE July 3 ----

This is the EXPLAINed result for the query

id  select_type     table   type    possible_keys       key         key_len     ref             rows    Extra
1   SIMPLE          V       range   status              status      1           NULL            112895  Using where; Using temporary; Using filesort
1   SIMPLE          VC      ref     video_id            video_id    4           V.id            2    
1   SIMPLE          C       eq_ref  PRIMARY             PRIMARY     4           VC.category_id  1       Using where
1   SIMPLE          U       eq_ref  PRIMARY             PRIMARY     4           V.user_id       1       Using where

I think the problem is that the SQL engine is "Using filesort" because it's using the 'status' index, instead of V.id. Also, it's "Using temporary" because of the number of records the engine has to write and in-memory table is not enough.

UPDATE (July 3): After some tests I reached to the conclusion that the problem of this particular query is that the usage of V.status as index doesn't help at all (98% of the videos has status=1)

Important Note: If I remove the 'V.status=1' filter from the WHERE clause, the query takes 0.01 s, and it uses V.id (PRIMARY) as index, solving it all.

---- End of Update note July 3 ----

TO SUM IT UP

Assuming I have all relevant indexes covered: How can I optimize the query, for it to take 0.1 seconds?

I'm pretty sure this is quite a good challenge for advanced SQL Administrators and programmers.

Upvotes: 2

Views: 150

Answers (2)

Jonathan Leffler
Jonathan Leffler

Reputation: 753585

Given your query (somewhat reformatted):

SELECT V.id
  FROM videos V
  LEFT JOIN videos_categories VC ON V.id = VC.video_id
  LEFT JOIN categories C ON VC.category_id = C.id
  LEFT JOIN users U ON V.user_id = U.id
 WHERE V.status = 1
   AND V.reported < 10
   AND (C.status = 1 OR C.id IS NULL)
   AND (U.status = 1 OR U.id IS NULL)
 GROUP BY V.id
 ORDER BY V.id DESC
 LIMIT 0, 12

You have incorrectly characterized your tables. You said:

  • 'videos' (id PK, +irrelevant fields) table holds +100k records.
  • 'videos_categories' (video_id INDEX,category_id INDEX) +600k records - multiple rows per video
  • 'categories' (id PK, + irrelevant fields)
  • 'users' (id PK, +irrelevant fields) Not a problem.

The cardinalities (row counts) of Categories and Users would be informative. More seriously, though, the query references:

  • videos.status
  • videos.reported
  • videos.user_id
  • categories.status
  • users.status

These fields should be being mentioned separately from the irrelevant fields, and any indexes on these columns should be identified. It would be better to provide table schemas which could be used to answer the query, with a comment '-- and other irrelevant columns' at the end of each table.

Does the Video_Categories table have a unique constraint on the combined (Video_ID, Category_ID) columns? Why not?

It isn't immediately clear why the Videos table has a User_ID column; it looks more like there should be a Video_Users table with (Video_ID, User_ID) columns. However, that's a separate discussion. Also, it is not clear why you would have videos without a user ID value, so the left outer join to users is puzzling too. However, you bravely assert that this is not part of the problem, so we take you at your word.

LEFT OUTER JOIN can be a serious performance inhibitor. You might get better results from a UNION (or you might not - UNION can also be a performance inhibitor!):

SELECT V.ID
  FROM (SELECT V.ID, V.User_ID, V.Status, V.Reported
          FROM videos AS V
          JOIN videos_categories AS VC ON V.id = VC.Video_ID
          JOIN categories AS C ON VC.category_id = C.ID
         WHERE C.Status = 1
        UNION
        SELECT V.ID, V.User_ID, V.Status, V.Reported
          FROM videos V
         WHERE V.ID NOT IN (SELECT Video_ID FROM Video_Categories)
       ) AS L
  LEFT JOIN Users AS U ON L.User_ID = U.ID
 WHERE L.Reported < 10
   AND (U.status = 1 OR U.ID IS NULL)
 GROUP BY L.ID
 ORDER BY L.ID DESC
 LIMIT 0, 12

(The alias 'L' is for 'List of Videos'.) The thinking here is that the first half of the UNION deals with inner joins, and the second half deals with the videos that aren't categorized. However, the NOT IN condition is likely to be a performance problem, if there is one. Come to think of it, I think the two lists of videos in the UNION should be disjoint, so you can use UNION ALL in place of UNION; that can be beneficial to performance (because it avoids a duplicate elimination phase).

You could probably usefully push the 'L.Reported < 10' condition down into each half of the UNION (where it becomes V.Reported < 10) if the optimizer does not do that automatically for you.

I'm by no means confident that this will perform better than the original, but it at least gives you some ideas to mull over.

Upvotes: 2

James
James

Reputation: 11

Jonathan raises some interesting and worthwhile points. In addition, if this is approached as an optimizer or index problem rather than a query problem, it might be worthwhile to ask how selectivity looks on the V.status column. (See here for more info on selectivity if needed.) If selectivity is poor, then:

  • It would likely be more efficient to do the join of V and VC tables and then filter out the rows that don't match the restriction on status

  • Indexing on V.status is probably not useful.

Some other things that might be useful to check are:

  • Update statistics on the V table (ANALYZE TABLE) in case poor stats are misleading the optimizer about selectivity on the status index

    • If this is on <5.5, check whether EXPLAIN shows the same plan on 5.5. Significant improvements were made to the optimizer on later versions.

    • Is Videos an InnoDB table? If so, the index on status is really on (PK, status) since the clustered index (the PK if there is one) is included in the nonclustered index on status. If it is MyISAM, you could test converting the table to see if that affects the plan.

As somewhat of an aside, I'd like point out as politely as possible that I think you might have a minor misconception about about what 'using temporary' and filesort mean. Baron Schwartz talks about that in a post here.

Upvotes: 1

Related Questions