Reputation: 883
I have three tables: categories, articles, and article_events, with the following structure
categories: id, name (100,000 rows)
articles: id, category_id (6000 rows)
article_events: id, article_id, status_id (20,000 rows)
The highest article_events.id for each article row describes the current status of each article.
I'm returning a table of categories and how many articles are in them with a most-recent-event status_id of '1'.
What I have so far works, but is fairly slow (10 seconds) with the size of my tables. Wondering if there's a way to make this faster. All the tables have proper indexes as far as I know.
SELECT c.id,
c.name,
SUM(CASE WHEN e.status_id = 1 THEN 1 ELSE 0 END) article_count
FROM categories c
LEFT JOIN articles a ON a.category_id = c.id
LEFT JOIN (
SELECT article_id, MAX(id) event_id
FROM article_events
GROUP BY article_id
) most_recent ON most_recent.article_id = a.id
LEFT JOIN article_events e ON most_recent.event_id = e.id
GROUP BY c.id
Basically I have to join to the events table twice, since asking for the status_id along with the MAX(id) just returns the first status_id it finds, and not the one associated with the MAX(id) row.
Any way to make this better? or do I just have to live with 10 seconds? Thanks!
Edit:
Here's my EXPLAIN for the query:
ID | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
---------------------------------------------------------------------------------------------------------------------------
1 | PRIMARY | c | index | NULL | PRIMARY | 4 | NULL | 124044 | Using index; Using temporary; Using filesort
1 | PRIMARY | a | ref | category_id | category_id | 4 | c.id | 3 |
1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 6351 |
1 | PRIMARY | e | eq_ref | PRIMARY | PRIMARY | 4 | most_recent.event_id | 1 |
2 | DERIVED | article_events | ALL | NULL | NULL | NULL | NULL | 19743 | Using temporary; Using filesort
Upvotes: 4
Views: 128
Reputation: 53830
If you can eliminate subqueries with JOINs, it often performs better because derived tables can't use indexes. Here's your query without subqueries:
SELECT c.id,
c.name,
COUNT(a1.article_id) AS article_count
FROM categories c
LEFT JOIN articles a ON a.category_id = c.id
LEFT JOIN article_events ae1
ON ae1.article_id = a.id
LEFT JOIN article_events ae2
ON ae2.article_id = a.id
AND ae2.id > a1.id
WHERE ae2.id IS NULL
GROUP BY c.id
You'll want to experiment with the indexes and use EXPLAIN to test, but here's my guess (I'm assuming id
fields are primary keys and you are using InnoDB):
categories: `name`
articles: `category_id`
article_events: (`article_id`, `id`)
Upvotes: 1
Reputation: 22865
I don't like that index on categories.id
is used, as you're selecting the whole table.
Try running:
ANALYZE TABLE categories;
ANALYZE TABLE article_events;
and re-run the query.
Upvotes: 0
Reputation: 1128
Didn't try it, but I'm thinking this will save a bit of work for the database:
SELECT ae.article_id AS ref_article_id,
MAX(ae.id) event_id,
ae.status_id,
(select a.category_id from articles a where a.id = ref_article_id) AS cat_id,
(select c.name from categories c where c.id = cat_id) AS cat_name
FROM article_events
GROUP BY ae.article_id
Hope that helps
EDIT:
By the way... Keep in mind that joins have to go through each row, so you should start your selection from the small end and work your way up, if you can help it. In this case, the query has to run through 100,000 records, and join each one, then join those 100,000 again, and again, and again, even if values are null, it still has to go through those.
Hope this all helps...
Upvotes: 0