Optimizing a simple query on two large tables

I'm trying to offer a feature where I can show pages most viewed by friends. My friends table has 5.7M rows and the views table has 5.3M rows. At the moment I just want to run a query on these two tables and find the 20 most viewed page id's by a person's friend.

Here's the query as I have it now:

SELECT page_id 
FROM `views` INNER JOIN `friendships` ON friendships.receiver_id = views.user_id 
WHERE (`friendships`.`creator_id` = 143416) 
GROUP BY page_id 
ORDER BY count(views.user_id) desc 
LIMIT 20

And here's how an explain looks:

+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| id | select_type | table       | type | possible_keys                           | key                             | key_len | ref                                     | rows | Extra                                        |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | friendships | ref  | PRIMARY,index_friendships_on_creator_id | index_friendships_on_creator_id | 4       | const                                   |  271 | Using index; Using temporary; Using filesort | 
|  1 | SIMPLE      | views       | ref  | PRIMARY                                 | PRIMARY                         | 4       | friendships.receiver_id                 |   11 | Using index                                  | 
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+

The views table has a primary key of (user_id, page_id), and you can see this is being used. The friendships table has a primary key of (receiver_id, creator_id), and a secondary index of (creator_id).

If I run this query without the group by and limit, there's about 25,000 rows for this particular user - which is typical.

On the most recent real run, this query took 7 seconds too execute, which is way too long for a decent response in a web app.

One thing I'm wondering is if I should adjust the secondary index to be (creator_id, receiver_id). I'm not sure that will give much of a performance gain though. I'll likely try it today depending on answers to this question.

Can you see any way the query can be rewritten to make it lightening fast?

Update: I need to do more testing on it, but it appears my nasty query works out better if I don't do the grouping and sorting in the db, but do it in ruby afterwards. The overall time is much shorter - by about 80% it seems. Perhaps my early testing was flawed - but this definitely warrants more investigation. If it's true - then wtf is Mysql doing?

Upvotes: 4

Answers (3)

brian-brazil

Reputation: 34142

Your indexes look correct although if friendship has very big rows, you might want the index on (creator_id, receiver_id) to avoid reading all of it.

However something's not right here, why are you doing a filesort for 271 rows? Make sure that your MySQL has at least a few megabytes for tmp_table_size and max_heap_table_size. That should make the GROUP BY faster.

sort_buffer should also have a sane value.

Upvotes: 0

nathan

Reputation: 4732

As far as I know, the best way to make a query like that "lightning fast", is to create a summary table that tracks friend page views per page per creator.

You would probably want to keep it up-to-date with triggers. Then your aggregation is already done for you, and it is a simple query to get the most viewed pages. You can make sure you have proper indexes on the summary table, so that the database doesn't even have to sort to get the most viewed.

Summary tables are the key to maintaining good performance for aggregation-type queries in read-mostly environments. You do the work up-front, when the updates occur (infrequent) and then the queries (frequent) don't have to do any work.

If your stats don't have to be perfect, and your writes are actually fairly frequent (which is probably the case for something like page views), you can batch up views in memory and process them in the background, so that the friends don't have to take the hit of keeping the summary table up-to-date, as they view pages. That solution also reduces contention on the database (fewer processes updating the summary table).

Upvotes: 1

Evert

Reputation: 99599

You should absolutely look into denormalizing this table. If you create a separate table that maintains the user id's and the exact counts for every page they viewed your query should become a lot simpler.

You can easily maintain this table by using a trigger on your views table, that does updates to the 'views_summary' table whenever an insert happens on the 'views' table.

You might even be able to denormalize this further by looking at the actual relationships, or just maintain the top x pages per person

Hope this helps,

Evert

Upvotes: 0

Optimizing a simple query on two large tables

Answers (3)

Related Questions