Reputation: 21
I've run into a problem with an SQL query, that is "failing" (taking too long) when the tables have more than 100k records. This should not be a problem, and I thought I had it covered as it works pretty fine for 50k records.
I'll try to be brief and clear, so I'll start with the query:
SELECT
V.id
FROM
videos V
LEFT JOIN videos_categories VC ON V.id = VC.video_id
LEFT JOIN categories C ON VC.category_id = C.id
LEFT JOIN users U ON V.user_id = U.id -- irrelevant table. Don't pay attention
WHERE
V.status = 1
AND (C.status = 1 OR C.id IS NULL)
AND (U.status = 1 OR U.id IS NULL) -- irrelevant
GROUP BY V.id
ORDER BY V.id DESC
LIMIT 0, 12
---------------------------------------------
**Query took 10.8771 sec** (very bad! this would take 0.1 max)
I'm using all LEFT JOINs because I don't want to restrain results if a category doesn't exist. This means that videos without assigned categories are also returned.
The idea of the tables structure is the next:
---- UPDATE July 3 ----
Structure for the tables:
CREATE TABLE `videos` ( -- Holding +100k records
`id` int(10) unsigned NOT NULL auto_increment,
`user_id` int(10) unsigned NOT NULL default '0', -- irrelevant for this example
`status` tinyint(1) NOT NULL default '0',
PRIMARY KEY (`id`),
KEY `status` (`status`)
-- ... -- Irrelevant Keys
) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC AUTO_INCREMENT=113339 ;
CREATE TABLE `videos_categories` ( -- Holding +600k records (several categories per video)
`video_id` int(10) unsigned NOT NULL default '0',
`category_id` int(10) unsigned NOT NULL default '0',
KEY `video_id` (`video_id`),
KEY `category_id` (`category_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Categories table has a PK id, and irrelevant fields. It holds 80 records. Users table is completely irrelevant and may be ignored. Sorry for adding it in the first instance.
---- END OF UPDATE July 3 ----
This is the EXPLAINed result for the query
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE V range status status 1 NULL 112895 Using where; Using temporary; Using filesort
1 SIMPLE VC ref video_id video_id 4 V.id 2
1 SIMPLE C eq_ref PRIMARY PRIMARY 4 VC.category_id 1 Using where
1 SIMPLE U eq_ref PRIMARY PRIMARY 4 V.user_id 1 Using where
I think the problem is that the SQL engine is "Using filesort" because it's using the 'status' index, instead of V.id.
Also, it's "Using temporary" because of the number of records the engine has to write and in-memory table is not enough.
UPDATE (July 3): After some tests I reached to the conclusion that the problem of this particular query is that the usage of V.status as index doesn't help at all (98% of the videos has status=1)
Important Note: If I remove the 'V.status=1' filter from the WHERE clause, the query takes 0.01 s, and it uses V.id (PRIMARY) as index, solving it all.
---- End of Update note July 3 ----
Assuming I have all relevant indexes covered: How can I optimize the query, for it to take 0.1 seconds?
I'm pretty sure this is quite a good challenge for advanced SQL Administrators and programmers.
Upvotes: 2
Views: 150
Reputation: 753585
Given your query (somewhat reformatted):
SELECT V.id
FROM videos V
LEFT JOIN videos_categories VC ON V.id = VC.video_id
LEFT JOIN categories C ON VC.category_id = C.id
LEFT JOIN users U ON V.user_id = U.id
WHERE V.status = 1
AND V.reported < 10
AND (C.status = 1 OR C.id IS NULL)
AND (U.status = 1 OR U.id IS NULL)
GROUP BY V.id
ORDER BY V.id DESC
LIMIT 0, 12
You have incorrectly characterized your tables. You said:
The cardinalities (row counts) of Categories and Users would be informative. More seriously, though, the query references:
These fields should be being mentioned separately from the irrelevant fields, and any indexes on these columns should be identified. It would be better to provide table schemas which could be used to answer the query, with a comment '-- and other irrelevant columns
' at the end of each table.
Does the Video_Categories table have a unique constraint on the combined (Video_ID, Category_ID) columns? Why not?
It isn't immediately clear why the Videos table has a User_ID column; it looks more like there should be a Video_Users table with (Video_ID, User_ID) columns. However, that's a separate discussion. Also, it is not clear why you would have videos without a user ID value, so the left outer join to users is puzzling too. However, you bravely assert that this is not part of the problem, so we take you at your word.
LEFT OUTER JOIN can be a serious performance inhibitor. You might get better results from a UNION (or you might not - UNION can also be a performance inhibitor!):
SELECT V.ID
FROM (SELECT V.ID, V.User_ID, V.Status, V.Reported
FROM videos AS V
JOIN videos_categories AS VC ON V.id = VC.Video_ID
JOIN categories AS C ON VC.category_id = C.ID
WHERE C.Status = 1
UNION
SELECT V.ID, V.User_ID, V.Status, V.Reported
FROM videos V
WHERE V.ID NOT IN (SELECT Video_ID FROM Video_Categories)
) AS L
LEFT JOIN Users AS U ON L.User_ID = U.ID
WHERE L.Reported < 10
AND (U.status = 1 OR U.ID IS NULL)
GROUP BY L.ID
ORDER BY L.ID DESC
LIMIT 0, 12
(The alias 'L' is for 'List of Videos'.) The thinking here is that the first half of the UNION deals with inner joins, and the second half deals with the videos that aren't categorized. However, the NOT IN condition is likely to be a performance problem, if there is one. Come to think of it, I think the two lists of videos in the UNION should be disjoint, so you can use UNION ALL in place of UNION; that can be beneficial to performance (because it avoids a duplicate elimination phase).
You could probably usefully push the 'L.Reported < 10
' condition down into each half of the UNION (where it becomes V.Reported < 10
) if the optimizer does not do that automatically for you.
I'm by no means confident that this will perform better than the original, but it at least gives you some ideas to mull over.
Upvotes: 2
Reputation: 11
Jonathan raises some interesting and worthwhile points. In addition, if this is approached as an optimizer or index problem rather than a query problem, it might be worthwhile to ask how selectivity looks on the V.status column. (See here for more info on selectivity if needed.) If selectivity is poor, then:
It would likely be more efficient to do the join of V and VC tables and then filter out the rows that don't match the restriction on status
Indexing on V.status is probably not useful.
Some other things that might be useful to check are:
Update statistics on the V table (ANALYZE TABLE) in case poor stats are misleading the optimizer about selectivity on the status index
If this is on <5.5, check whether EXPLAIN shows the same plan on 5.5. Significant improvements were made to the optimizer on later versions.
Is Videos an InnoDB table? If so, the index on status is really on (PK, status) since the clustered index (the PK if there is one) is included in the nonclustered index on status. If it is MyISAM, you could test converting the table to see if that affects the plan.
As somewhat of an aside, I'd like point out as politely as possible that I think you might have a minor misconception about about what 'using temporary' and filesort mean. Baron Schwartz talks about that in a post here.
Upvotes: 1