Reputation: 13110
More of a question of interest than something I'm looking to use
Say I run the following query:
SELECT su.id, su.name, sua.line_1, sua.line_2
FROM site_user su
JOIN site_user_address sua
ON sua.user_id = su.id
WHERE su.id = 1
GROUP BY su.id /* id is the PK for site_user */
Are sua.line_1
and sua.line_2
guaranteed to come back from the same site_user_address
row?
I'm aware that, at the very least, the site_user_address
row is arbitrarily selected
I couldn't find anything in the docs, and can't think of a reliable way to test this
UPDATE
This is not a duplicate as far as I'm aware.. I'm wondering if the arbitrary selection of a value will come from the same arbitrarily selected row when choosing two different columns from the same table
Other questions seem to be less specific, and more directed towards the arbitrary selection of both row and value of one or more non-aggregate columns
Upvotes: 3
Views: 65
Reputation: 108651
You are indeed exploiting the notorious non-standard MySQL implementation of GROUP BY
. Read this for more info https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html
Your query is equivalent to using ANY_VALUE()
for the values from the second table, like this.
SELECT su.id, su.name,
ANY_VALUE(sua.line_1) line_1,
ANY_VALUE(sua.line_2) line_2
FROM site_user su
JOIN site_user_address sua
ON sua.user_id = su.id
WHERE su.id = 1
GROUP BY su.id
As long as site_user.id is unique for each row, you'll get the right name because it's dependent on the id value.
But for the values from the other table, your results will be, formally speaking, non-deterministic or unpredictable. Do they come back from the same row as each other? Unpredictable.
Unpredictable is like random, but worse. Random implies you'll get different values sometimes, so you can catch problems in test. Unpredictable means you'll get the same values every time, until you don't. Usually changes in the chosen values occur when tables and indexes grow larger.
You may want to use a query that gives predictable results. For example, this one will give back the rows of site_user_address
that have the highest value of site_user_address.id
.
SELECT su.id, su.name,
sua.line_1, sua.line_2
FROM site_user su
JOIN (
SELECT MAX(id) id, user_id
FROM site_user_address
GROUP BY user_id
) sumax ON su.id = sumax.user_id
JOIN site_user_address sua ON sumax.id = sua.id
WHERE su.id = 1
The subquery:
SELECT MAX(id) id, user_id
FROM site_user_address
GROUP BY user_id
retrieves the largest site_user_address.id
value for each user_id
value. You can then join to the site_user_address table to pull the detail values for those id
values.
Beware the temptation to infer that behaviors are designed into a DBMS, but not documented, just because you observe them. The behavior in your question is documented. The documentation says it is non-deterministic. That means that the DBMS query planner is free to satisfy your query in whatever way seems most efficient. As tables and indexes grow, the query planner can, and does, choose different ways to satisfy the same query. Different query plans can also come from different parts of tables being resident in caches. Etc. Etc.
This unpredictability is hard for programmers to wrap our brains around. We wouldn't tolerate unpredictability in procedural Java or php code. But SQL is declarative, and many thousands of programmer years have gone into making queries run fast. Don't try to outsmart a DBMS, even if you're Michael Stonebraker himself.
Upvotes: 2