Reputation: 43494
I have the following (intentionally denormalized for demonstrating purposes) sample CARS table:
| CAR_ID | OWNER_ID | OWNER_NAME | COLOR |
|--------|----------|------------|-------|
| 1 | 1 | John | White |
| 2 | 1 | John | Black |
| 3 | 2 | Mike | White |
| 4 | 2 | Mike | Black |
| 5 | 2 | Mike | Brown |
| 6 | 3 | Tony | White |
If I wanted to count the amount of cars per owner and return this:
| OWNER_ID | OWNER_NAME | TOTAL |
|----------|------------|-------|
| 1 | John | 2 |
| 2 | Mike | 3 |
| 3 | Tony | 1 |
I know I can write the following query:
SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id, owner_name
However, removing owner_name
from the GROUP BY
clause gives me the same results.
SELECT
statement and in which ones shouldn't I?Upvotes: 1
Views: 281
Reputation: 43494
The first thing to make clear is that SQL is not MySQL.
In standard SQL it is not allowed to group by a subset of the non-aggregated fields. The reason is very simple. Suppose I'm running this query:
SELECT color, owner_name, COUNT(*) FROM cars
GROUP BY color
That query would not make any sense. Even trying to explain it would be impossible. For sure it is selecting colors and counting the amount of cars per color. However, it is also adding the owner_name
field and there can be many owners for a given color, as it is the case of the White
color. So if there can be many owner_name
values for a single color
which happens to be the only field in the GROUP BY
clause... then which owner_name
will be returned?
If it is needed to return an owner_name
then some kind of criteria should be added to only select one of them, e.g., the first one alphabetically, which in this case would be John
. That criteria would result in adding an aggregate function MIN(owner_name)
and then the query will make sense again as it will be grouping by, at least, all the non-agreggated fields in the select statement.
As you can see, there is a clear and practical reason for standard SQL to be inflexible in the grouping. If it wasn't, you could face awkward situations in which the value for a column will be unpredictable, and that is not a nice word, particularly if the query being run is showing you your bank account transactions.
Having said that, then why would MySQL allow queries that might not make sense? And even worse, the error in the query above could be just syntactically detected! The short answer is: performance. The long answer is that there are certain situations in which, based on data relations, getting an unpredictable value from the group will result in a predictable value.
If you haven't figured it out yet, the only way in which you can predict the value you'll get from taking an unpredictable element from a group will be if all the elements in the group are the same. A clear example of this situation is in the sample query in your very same question. Look at how owner_id
and owner_name
relates in the table. It is clear that given any owner_id
, e.g. 2
, you can only have one distinct owner_name
. Even having many rows, by choosing any, you will get Mike
as the result. In formal database jargon this can be explained as owner_id
functionally determines owner_name
.
Let's take a closer look at that fully working MySQL query:
SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id
Given any owner_id
this would return the same owner_name
, so adding it to the GROUP BY
clause will not result in more rows returned. Even adding an aggregated function MAX(owner_name)
will not result in less rows returned. The resulting data will be exacly the same. In both cases, the query would be immediately turned into a legal standard SQL query as at least all the non-aggregated fields would be grouped by. So there are 3 approaches to get the same results.
However, as I mentioned before, this non-standard grouping has a performance advantage. You can check this so underrated link in which this is explained for more detail but I'm going to cite the most important part:
You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. [...] The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
One thing that is worth mentioning is that the results are not necessarily wrong but rather indeterminate. In other words, getting the expected results does not mean you have written the right query. Writing the right query will always give you the expected results.
As you can see, it might be worth applying this MySQL extension to the GROUP BY
clause. Anyway, if this is not 100% clear yet then there is a rule of thumb that will make sure that your grouping will always be correct: Always group, at least, by all the non-aggregated fields in the select clause. You might be wasting a few CPU cycles in certain situations but it is better than returning indeterminate results. If you're still terrified about not grouping correctly then changing the ONLY_FULL_GROUP_BY
SQL mode could be a last resort :)
May your grouping be correct and performant... or at least correct.
Upvotes: 4