GJ.
GJ.

Reputation: 870

MySQL query - derived table vs left join where

Given posts and posts_tags:

CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    external_id varchar(10) UNIQUE,
    title varchar(255),
    body TEXT
)

CREATE TABLE posts_tags (
    id SERIAL PRIMARY KEY,
    post_external_id varchar(10) REFERENCES posts(external_id),
    tag varchar(255),
    UNIQUE KEY (post_external_id, tag)
)

As you can see the above tables store posts and their tags for a backend serving REST APIs.

There is a get-post-by-external-id API that will return the post with its tags identified by the post's external_id. Given this use case, I'm wondering which query below executes faster and consumes less resources?

Derived table

SELECT posts.*, posts_tags.tag 
FROM (SELECT * FROM posts WHERE external_id = ?) AS posts 
LEFT JOIN posts_tags ON posts.external_id = posts_tags.post_external_id

Left join where

SELECT posts.*, posts_tags.tag 
FROM posts 
LEFT JOIN posts_tags ON posts.external_id = posts_tags.post_external_id
WHERE posts.external_id = ?

I'm not sure, but I guess the WHERE filters the intermediate results after the LETF JOIN happens?

Upvotes: 1

Views: 274

Answers (1)

Paul Spiegel
Paul Spiegel

Reputation: 31812

There is no need for a subquery. The engine doesn't evaluate/execute the query in the same order as you write. The optimizer will generate an "execution path" as optimal as possible (though it doesn't work perfect). However for this particular case it's documented in WHERE Clause Optimization:

For each table in a join, a simpler WHERE is constructed to get a fast WHERE evaluation for the table and also to skip rows as soon as possible.

So your guess:

I guess the WHERE filters the intermediate results after the LETF JOIN happens?

is wrong. Given an index on posts.external_id the engine will only "touch" the matching row(s) and perform the join with posts_tags only for that/those row(s).

And from my experience I can also tell you, that the second (trivial) query is perfectly fine performance wise.

I would though not use that query for another reason. You get the post data duplicated per related tag. Consider you have a post containing a 10KB text and having 10 tags. You will get 10 rows (one per tag) with the same text and fetch together 100KB of data. Then you will need to write code to "deduplicate" the data.

I would rather either run two separate queries:

SELECT posts.*
FROM posts 
WHERE posts.external_id = ?

and

SELECT posts_tags.tag 
FROM posts_tags
WHERE posts_tags.post_external_id = ?

and "merge" the result in application code

or aggregate the tags with GROUP_CONCAT()

SELECT posts.*, GROUP_CONCAT(posts_tags.tag) as tags
FROM posts 
LEFT JOIN posts_tags ON posts.external_id = posts_tags.post_external_id
WHERE posts.external_id = ?
GROUP BY posts.external_id

or with JSON_ARRAYAGG()

SELECT posts.*, JSON_ARRAYAGG(posts_tags.tag) as tags
FROM posts 
LEFT JOIN posts_tags ON posts.external_id = posts_tags.post_external_id
WHERE posts.external_id = ?
GROUP BY posts.external_id

Parsing a comma delimited list or a JSON array should be simple in any application language.

Upvotes: 1

Related Questions