Reputation: 16181

Optimize MySQL aggregate query for a single table with 28 million rows

I need help optimizing a query. There's a pivot table holding user IDs matched with notification IDs per user:

+----+---------+-----------------+
| id | user_id | notification_id |
+----+---------+-----------------+
|  1 |     234 |               3 |
|  2 |     546 |              34 |
|  3 |     646 |              11 |
+----+---------+-----------------+

Both user_id and notification_id are foreign keys. The table has ~28 million rows.

The idea is to get 100 IDs of users who have more than 120 notifications, ordered by the ones with most notifications:

SELECT user_id, COUNT(feed_notification_id) AS notification_count 
FROM sd_user_feed_notification
GROUP BY user_id
HAVING notification_count >= 120
ORDER BY notification_count DESC
LIMIT 100

The problem is that the query above runs for over 200s, as it has to basically go through all the rows to aggregate the notifications.

The foreign keys are already indexes. The query itself is pretty straightforward.

Is there any way to optimize it?

MySQL version: 5.6

Upvotes: 0

Answers (2)

Rick James

Reputation: 142356

sd_user_feed_notification sounds like a mapping table between user and feed_notification. If so, get rid of the FKs and follow the rules given here for many-to-many tables. That will include

PRIMARY KEY(user_id, notification_id),  -- implies UNIQUE
INDEX(notification_id, user_id)         -- saying UNIQUE would be redundant

(or vice versa). At which point all of the above comments are handled. Furthermore the table has only 2 columns, so it is as small as an index -- either is as fast as it can be.

In virtually cases, get rid of INDEX(a) when you add INDEX(a,b). But do not drop INDEX(b). The order of the columns in a composite index is important. More

Upvotes: 0

spencer7593

Reputation: 108450

If there is no composite index on (user_id, feed_notification_id), then likely the query isn't being satisfied entirely from an index. That is, the execution plan is performing lookups to the underlying table pages, to check if feed_notification_id is NULL. (A COUNT(expr) aggregate will not include rows where the expression evaluates to NULL.)

We would (likley) get better performance with a query that can be satisfied from an index, for example, by removing the reference to feed_notification_id column.

If we are guaranteed that feed_notification_id is NOT NULL, then this would get us an equivalent result:

EXPLAIN 
SELECT user_id
     , COUNT(1) AS notification_count 
  FROM sd_user_feed_notification
 GROUP BY user_id

(We expect the EXPLAIN output to show "Using index" in the Extra column.)

So the query would be a full scan of just an index, with no lookups to the underlying table.

That's still going to need to evaluate 28 million rows. A And with the ORDER BY on the aggregate expression, there's no getting around the "Using filesort" operation.

If we have to stick with the existing query, then optimal performance (of that query) would be with a composite index ON sd_user_feed_notification (user_id, feed_notification_id).

And adding that index would render an index ON sd_user_feed_notification (user_id) redundant.

FOLLOWUP

Q: (1) Should I then remove the single indexes on user_id and notification_id and stick to the compound one only in the case of my query?

Q: (2) Wouldn't this affect other queries ran against the table?

A: If we add the composite index on (user_id,feed_notification_id), then we can remove the index on just (user_id). This composite index is suitable for supporting a foreign key constraint.

Any query that was benefiting from the old (singleton user_id column) index can benefit from the replacement (composite) index (with user_id as the leading column.)

And some queries will benefit more, eliminating lookups to pages in the underlying table (to retrieve values of notification_id.)

The replacement index will be larger, but it will work the same, in terms of improving performance by eliminating vast swaths of rows when we're looking for rows related to a single user.

The new composite index is not a replacement for the index on the feed_notification_id column.

We would still need an index that has that column as the leading column. (We could replace it with a composite index on (feed_notification_id,user_id).

The order of columns in an index is significant.

If the combination of (user_id,feed_notification_id) is UNIQUE, then we can define the index as a UNIQUE index, and enforce that.

Also, if this table is purely a linkage/association/join table, and is not an entity table (i.e. there are no foreign key references to this table), then for performance, I would consider dropping the id column (presumably that's defined as the PRIMARY (cluster) key.

I would tend towards a table definition like this:

CREATE TABLE sd_user_feed_notification
( user_id               INT NOT NULL COMMENT 'PK, FK ref user.id'
, feed_notification_id  INT NOT NULL COMMENT 'PK, FK ref feed_notification.id'
, PRIMARY KEY (user_id, feed_notification_id)
, KEY sd_user_feed_notification_IX (feed_notification_id, user_id)

, CONSTRAINT FK_sd_user_feed_notification_user 
  FOREIGN KEY (user_id)              REFERENCES sd_user (id) 
  ON UPDATE CASCADE ON DELETE CASCADE 

, CONSTRAINT FK_sd_user_feed_notification_feed
  FOREIGN KEY (feed_notification_id) REFERENCES sd_feed_notification (id)
  ON UPDATE CASCADE ON DELETE CASCADE

) ENGINE=InnoDB
;

Upvotes: 1

Optimize MySQL aggregate query for a single table with 28 million rows

Answers (2)

Related Questions