Mosty Mostacho
Mosty Mostacho

Reputation: 43494

How to properly GROUP BY in MySQL?

I have the following (intentionally denormalized for demonstrating purposes) sample CARS table:

| CAR_ID | OWNER_ID | OWNER_NAME | COLOR |
|--------|----------|------------|-------|
|      1 |        1 |       John | White |
|      2 |        1 |       John | Black |
|      3 |        2 |       Mike | White |
|      4 |        2 |       Mike | Black |
|      5 |        2 |       Mike | Brown |
|      6 |        3 |       Tony | White |

If I wanted to count the amount of cars per owner and return this:

| OWNER_ID | OWNER_NAME | TOTAL |
|----------|------------|-------|
|        1 |       John |     2 |
|        2 |       Mike |     3 |
|        3 |       Tony |     1 |

I know I can write the following query:

SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id, owner_name

However, removing owner_name from the GROUP BY clause gives me the same results.

  1. What is the difference between those 2 queries?
  2. Under what circumstances should I group by all non-agreggated fields in the SELECT statement and in which ones shouldn't I?
  3. Can you give an example in which this grouping would return different results when removing a non-aggregated field and explain why?

Upvotes: 1

Views: 281

Answers (1)

Mosty Mostacho
Mosty Mostacho

Reputation: 43494

The first thing to make clear is that SQL is not MySQL.

In standard SQL it is not allowed to group by a subset of the non-aggregated fields. The reason is very simple. Suppose I'm running this query:

SELECT color, owner_name, COUNT(*) FROM cars
GROUP BY color

That query would not make any sense. Even trying to explain it would be impossible. For sure it is selecting colors and counting the amount of cars per color. However, it is also adding the owner_name field and there can be many owners for a given color, as it is the case of the White color. So if there can be many owner_name values for a single color which happens to be the only field in the GROUP BY clause... then which owner_name will be returned?

If it is needed to return an owner_name then some kind of criteria should be added to only select one of them, e.g., the first one alphabetically, which in this case would be John. That criteria would result in adding an aggregate function MIN(owner_name) and then the query will make sense again as it will be grouping by, at least, all the non-agreggated fields in the select statement.

As you can see, there is a clear and practical reason for standard SQL to be inflexible in the grouping. If it wasn't, you could face awkward situations in which the value for a column will be unpredictable, and that is not a nice word, particularly if the query being run is showing you your bank account transactions.

Having said that, then why would MySQL allow queries that might not make sense? And even worse, the error in the query above could be just syntactically detected! The short answer is: performance. The long answer is that there are certain situations in which, based on data relations, getting an unpredictable value from the group will result in a predictable value.

If you haven't figured it out yet, the only way in which you can predict the value you'll get from taking an unpredictable element from a group will be if all the elements in the group are the same. A clear example of this situation is in the sample query in your very same question. Look at how owner_id and owner_name relates in the table. It is clear that given any owner_id, e.g. 2, you can only have one distinct owner_name. Even having many rows, by choosing any, you will get Mike as the result. In formal database jargon this can be explained as owner_id functionally determines owner_name.

Let's take a closer look at that fully working MySQL query:

SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id

Given any owner_id this would return the same owner_name, so adding it to the GROUP BY clause will not result in more rows returned. Even adding an aggregated function MAX(owner_name) will not result in less rows returned. The resulting data will be exacly the same. In both cases, the query would be immediately turned into a legal standard SQL query as at least all the non-aggregated fields would be grouped by. So there are 3 approaches to get the same results.

However, as I mentioned before, this non-standard grouping has a performance advantage. You can check this so underrated link in which this is explained for more detail but I'm going to cite the most important part:

You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. [...] The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.

One thing that is worth mentioning is that the results are not necessarily wrong but rather indeterminate. In other words, getting the expected results does not mean you have written the right query. Writing the right query will always give you the expected results.

As you can see, it might be worth applying this MySQL extension to the GROUP BY clause. Anyway, if this is not 100% clear yet then there is a rule of thumb that will make sure that your grouping will always be correct: Always group, at least, by all the non-aggregated fields in the select clause. You might be wasting a few CPU cycles in certain situations but it is better than returning indeterminate results. If you're still terrified about not grouping correctly then changing the ONLY_FULL_GROUP_BY SQL mode could be a last resort :)

May your grouping be correct and performant... or at least correct.

Upvotes: 4

Related Questions