Hugo Pakula
Hugo Pakula

Reputation: 385

MySQL Query full table scan when keys available

In attempting to pull a large series of columns (~15-20) from several joined tables, I put together 2 views that would pull the necessary information. In my local DB (only ~1k posts rows), joining these views worked fine, however; when I created those same views on our production DB (~30k posts rows) and attempted to join the view, I realized that that solution wouldn't scale beyond a test dataset.

I attempted to migrate those 2 views (categories data—like categories.title—and creators' data—like users.display_name) into a CTE post_data which, in theory, would act as a keyed version of those views, and allow me to get all post data for the eligible posts.

I have put together a sample DBFiddle with some test data to explain the table structure. The actual data has many more columns, but this is representative of the joins necessary to build the query.

table : posts
+-----+-----------+------------+------------------------------------------+----------------------------------------+
| id  | parent_id | created_by |                 message                  |              attachments               |
+-----+-----------+------------+------------------------------------------+----------------------------------------+
|  8  | NULL      |          8 | laptop for sale                          | [{"media_id": 1380}]                   |
|  9  | NULL      |          4 | NEW lamp shade up for grabs              | [{"media_id": 1442}, {"link_id": 103}] |
|  10 | 1         |          7 | Oooh I could be interested               |                                        |
|  11 | 1         |          7 | DMing you now! I've been looking for one |                                        |
+-----+-----------+------------+------------------------------------------+----------------------------------------+

table : users
+----+------------------+---------------------------+
| id |   display_name   |        created_at         |
+----+------------------+---------------------------+
|  1 | John Appleseed   | 2018-02-20T00:00:00+00:00 |
|  2 | Massimo Jenkins  | 2018-05-14T00:00:00+00:00 |
|  3 | Johanna Marionna | 2018-06-05T00:00:00+00:00 |
|  4 | Jackson Creek    | 2018-11-15T00:00:00+00:00 |
|  5 | Joe Schmoe       | 2019-01-09T00:00:00+00:00 |
|  6 | John Johnson     | 2019-02-14T00:00:00+00:00 |
|  7 | Donna Madison    | 2019-05-14T00:00:00+00:00 |
|  8 | Jenna Kaplan     | 2019-06-23T00:00:00+00:00 |
+----+------------------+---------------------------+

table : categories
+----+------------+------------+-------------------------------------------------------+
| id | created_by |   title    |                      description                      |
+----+------------+------------+-------------------------------------------------------+
|  1 |          2 | Technology | Anything tech; Consumer, business or education tools! |
|  2 |          2 | Home Goods | Anything for the home                                 |
+----+------------+------------+-------------------------------------------------------+

table : categories_posts
+---------+-------------+
| post_id | category_id |
+---------+-------------+
|       8 |           1 |
|       9 |           1 |
|      10 |           1 |
|      11 |           1 |
+---------+-------------+

table : users_categories
+---------+-------------+
| user_id | category_id |
+---------+-------------+
|       1 |           1 |
|       2 |           1 |
|       3 |           1 |
|       4 |           1 |
+---------+-------------+

table : posts_removed
+---------+----------------------+------------+
| post_id |      removed_at      | removed_by |
+---------+----------------------+------------+
|      10 |  2019-01-22 09:08:14 |          7 |
+---------+----------------------+------------+

In the below query, eligible posts are determined in the base SELECT; then, the post_data CTE is joined to the result set (limited to 25 rows) and all columns from the CTE are returned.

WITH post_data AS (
    SELECT posts.id,
           posts.parent_id,
           posts.created_by,
           posts.attachments,
           categories_posts.category_id,
           categories.title,
           categories.created_by AS category_created_by,
           creator.display_name AS creator_display_name,
           creator.created_at AS creator_created_at
           /* ... And a whole bunch of other fields from posts, categories_posts, users */
    FROM posts
    LEFT OUTER JOIN categories_posts
        ON categories_posts.post_id = posts.id
    LEFT OUTER JOIN categories
        ON categories.id = categories_posts.category_id
    LEFT OUTER JOIN users creator
        ON creator.id = posts.created_by
    /* ... And a whole bunch of other joins to facilitate the selected fields */
)
SELECT post_data.*
FROM posts
        /* Set up the criteria for the posts selected before getting their data from the CTE */
    LEFT OUTER JOIN posts_removed removed ON removed.post_id = posts.id
    LEFT OUTER JOIN users user_me ON user_me.id = "1"
    LEFT OUTER JOIN users_followed ON users_followed.user_id = posts.created_by
        AND users_followed.followed_by = user_me.id
    LEFT OUTER JOIN categories_posts ON categories_posts.post_id = posts.id
    LEFT OUTER JOIN users_categories ON users_categories.category_id = categories_posts.category_id
    LEFT OUTER JOIN posts_removed pp_removed ON pp_removed.post_id = posts.parent_id
    /* Join our post_data on the post's ID */
    JOIN post_data ON post_data.id = posts.id
WHERE
(
    (
        users_categories.user_id = user_me.id AND users_categories.left_at IS NULL
    ) OR categories_posts.category_id IS NULL
) AND (
    posts.created_by = user_me.id
    OR users_followed.followed_by = user_me.id
    OR categories_posts.category_id IS NOT NULL
) AND removed.removed_at IS NULL
    AND pp_removed.removed_at IS NULL
    AND (post_data.id = posts.id OR post_data.id = posts.parent_id)
ORDER BY posts.id DESC
LIMIT 25

In theory, I thought this would work by selecting the rows based on the base select criteria, then doing an index scan for the CTE based on the Post ID; however, it seems that the query optimizer chooses instead to do a full table scan of the posts table.

The EXPLAIN SELECT gave me this information:

+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+
| id | select_type |         table          |  type  |         possible_keys         |     key     | key_len |                     ref                     |  rows  | filtered |                       extra                        |
+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+
|  1 | PRIMARY     | posts                  | ALL    | PRIMARY,parent_id,created_by  |             |         |                                             |  33870 |      100 | Using temporary; Using filesort                    |
|  1 | PRIMARY     | removed                | eq_ref | PRIMARY                       | PRIMARY     |       8 | posts.id                                    |      1 |       19 | Using where                                        |
|  1 | PRIMARY     | user_me                | const  | PRIMARY                       | PRIMARY     |       8 | const                                       |      1 |      100 | Using where; Using index                           |
|  1 | PRIMARY     | categories_posts       | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.id                                |      1 |      100 |                                                    |
|  1 | PRIMARY     | categories             | eq_ref | PRIMARY                       | PRIMARY     |       8 | categories_posts.category_id                |      1 |      100 | Using index                                        |
|  1 | PRIMARY     | users_categories       | eq_ref | user_id_2,user_id,category_id | user_id_2   |      16 | user_me.id,api.categories_posts.category_id |      1 |      100 | Using where                                        |
|  1 | PRIMARY     | users_followed         | eq_ref | user_id,followed_by           | user_id     |      16 | posts.created_by,api.user_me.id             |      1 |      100 | Using where; Using index                           |
|  1 | PRIMARY     | pp_removed             | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.parent_id                         |      1 |       19 | Using where                                        |
|  1 | PRIMARY     | <derived2>             | ALL    |                               |             |         |                                             | 493911 |       19 | Using where; Using join buffer (Block Nested Loop) |
|  2 | DERIVED     | posts                  | ALL    |                               |             |         |                                             |  33870 |      100 | Using temporary                                    |
|  2 | DERIVED     | categories_posts       | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.id                                |      1 |      100 |                                                    |
|  2 | DERIVED     | categories             | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.categories_posts.category_id            |      1 |      100 |                                                    |
|  2 | DERIVED     | posts_votes            | ref    | post_id                       | post_id     |       8 | api.posts.id                                |      1 |      100 | Using index                                        |
|  2 | DERIVED     | pp                     | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.parent_id                         |      1 |      100 |                                                    |
|  2 | DERIVED     | pp_removed             | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.pp.id                                   |      1 |      100 | Using index                                        |
|  2 | DERIVED     | removed                | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.id                                |      1 |      100 | Using index                                        |
|  2 | DERIVED     | creator                | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.created_by                        |      1 |      100 |                                                    |
|  2 | DERIVED     | usernames              | ref    | user_id                       | user_id     |       8 | api.creator.id                              |      1 |      100 |                                                    |
|  2 | DERIVED     | verifications          | ALL    |                               |             |         |                                             |      4 |      100 | Using where; Using join buffer (Block Nested Loop) |
|  2 | DERIVED     | categories_identifiers | ref    | category_id                   | category_id |       8 | api.categories.id                           |      1 |      100 |                                                    |
+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+

Beyond this, I tried refactoring my query to try and force key usage in the posts table, such as using FORCE INDEX(PRIMARY) in the select, and moving the CTE be the base query and adding a filter WHERE id IN ({the original base query}), but it seems the optimizer still does a full table scan.

In case it's helpful to decode what's happening in the query plan:

My core questions are:

  1. Am I correct when I say that subqueries should only be executed once per result row from the base select query? If so, then the CTE should also use the JOIN on posts.id and likely use the table index?

  2. Why does the query plan show that it selects 33,870 rows when there are only 33,387? And where do the 493,911 rows come from?

  3. How do you prevent a full table scan in this case?

Upvotes: 0

Views: 774

Answers (1)

Rick James
Rick James

Reputation: 142518

Give this a try... Do the LIMIT 25 before JOINing to the WITH:

SELECT * FROM
    ( SELECT ... FROM posts
               JOIN categories_posts ...
        ORDER BY posts.id DESC
        LIMIT 25 ) AS x
    JOIN post_data
       ON post_data.id IN (x.id, x.parent_id)
    ORDER BY posts.id DESC

Upvotes: 0

Related Questions