Are arbitrarily SELECTed non-aggregate columns guaranteed to come from the same row with a MySQL GROUP BY?

Question

More of a question of interest than something I'm looking to use

Say I run the following query:

  SELECT su.id, su.name, sua.line_1, sua.line_2
    FROM site_user su
    JOIN site_user_address sua
      ON sua.user_id = su.id
   WHERE su.id = 1 
GROUP BY su.id /* id is the PK for site_user */

Are sua.line_1 and sua.line_2 guaranteed to come back from the same site_user_address row?

I'm aware that, at the very least, the site_user_address row is arbitrarily selected

I couldn't find anything in the docs, and can't think of a reliable way to test this

UPDATE

This is not a duplicate as far as I'm aware.. I'm wondering if the arbitrary selection of a value will come from the same arbitrarily selected row when choosing two different columns from the same table

Other questions seem to be less specific, and more directed towards the arbitrary selection of both row and value of one or more non-aggregate columns

O. Jones · Accepted Answer

You are indeed exploiting the notorious non-standard MySQL implementation of GROUP BY. Read this for more info https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html

Your query is equivalent to using ANY_VALUE() for the values from the second table, like this.

  SELECT su.id, su.name, 
         ANY_VALUE(sua.line_1) line_1, 
         ANY_VALUE(sua.line_2) line_2
    FROM site_user su
    JOIN site_user_address sua
      ON sua.user_id = su.id
   WHERE su.id = 1 
   GROUP BY su.id

As long as site_user.id is unique for each row, you'll get the right name because it's dependent on the id value.

But for the values from the other table, your results will be, formally speaking, non-deterministic or unpredictable. Do they come back from the same row as each other? Unpredictable.

Unpredictable is like random, but worse. Random implies you'll get different values sometimes, so you can catch problems in test. Unpredictable means you'll get the same values every time, until you don't. Usually changes in the chosen values occur when tables and indexes grow larger.

You may want to use a query that gives predictable results. For example, this one will give back the rows of site_user_address that have the highest value of site_user_address.id.

  SELECT su.id, su.name, 
         sua.line_1, sua.line_2
    FROM site_user su
    JOIN (
              SELECT MAX(id) id, user_id
                FROM site_user_address
               GROUP BY user_id
         ) sumax               ON su.id = sumax.user_id
    JOIN site_user_address sua ON sumax.id = sua.id
   WHERE su.id = 1

The subquery:

              SELECT MAX(id) id, user_id
                FROM site_user_address
               GROUP BY user_id

retrieves the largest site_user_address.id value for each user_id value. You can then join to the site_user_address table to pull the detail values for those id values.

Beware the temptation to infer that behaviors are designed into a DBMS, but not documented, just because you observe them. The behavior in your question is documented. The documentation says it is non-deterministic. That means that the DBMS query planner is free to satisfy your query in whatever way seems most efficient. As tables and indexes grow, the query planner can, and does, choose different ways to satisfy the same query. Different query plans can also come from different parts of tables being resident in caches. Etc. Etc.

This unpredictability is hard for programmers to wrap our brains around. We wouldn't tolerate unpredictability in procedural Java or php code. But SQL is declarative, and many thousands of programmer years have gone into making queries run fast. Don't try to outsmart a DBMS, even if you're Michael Stonebraker himself.

Are arbitrarily SELECTed non-aggregate columns guaranteed to come from the same row with a MySQL GROUP BY?

Answers (1)

Related Questions