Reputation: 3
This is a problem from sqlzoo.net
Here is the table world:
+-------------+-----------+---------+ | name | continent | area | +-------------+-----------+---------+ | Afghanistan | Asia | 652230 | | Albania | Europe | 2831741 | | Algeria | Africa | 28748 | | ... | ... | ... | +-------------+-----------+---------+
The Question:
Find the largest country (by area) in each continent, show the continent, the name and the area:
The answer I am trying to understand is:
SELECT continent, name, area
FROM world x
WHERE area >= ALL (SELECT area
FROM world y
WHERE y.continent=x.continent
AND area>0)
This code gives:
+-------------+-----------+--------+ | continent | name | area | +-------------+-----------+--------+ |Africa | Algeria | 2381741| |Oceania | Australia | 7692024| |South America| Brazil | 8515767| |North America| Canada | 9984670| |Asia | China | 9596961| |Caribbean | Cuba | 109884| |Europe | France | 640679| |Eurasia | Russia |17125242| +-------------+-----------+--------+
I do not understand how this works. I think that the inner select should yield a table that contains all the areas and the outer select only chooses the biggest (>=
). But how does it filter down to a list that seems to have been group by continent? How does y.continent=x.continent AND area>0
work?
Upvotes: 0
Views: 149
Reputation: 14832
To explain, I'll assume a world table with only 5 countries as follows:
Algeria 2381741
Australia 7692024
South Africa 1221037
New Zealand 268021
/*And to make it a little interesting:*/
Algeria Twin 2381741
The sub-query is matched to each row of the base query, 1-at-a-time. This is what's known as a correlated sub-query. Although correlated sub-queries work well, they are generally considered dangerous because of their tendency to yield poor performance characteristics if the optimiser cannot figure out a more efficient, equivalent structure.
The following table illustrates a logical view on how the data will be evaluated. Note that the database's query engine might be able to internally transform the plan into something mathematically equivalent, but far more efficient.
+-------------+--------------+--------+
| continent | name | area |
+-------------+--------------+--------+
|Africa | Algeria | 2381741| >= ALL( /*y.continent='Africa'*/
2381741, /*Alegria*/
1221037, /*South Africa*/
2381741) /*Alegria Twin*/
|Oceania | Australia | 7692024| >= ALL( /*y.continent='Oceania'*/
7692024, /*Australia*/
268021) /*New Zealand*/
|Africa | South Africa | 1221037| >= ALL( /*y.continent='Africa'*/
2381741, /*Alegria*/
1221037, /*South Africa*/
2381741) /*Alegria Twin*/
|Oceania | New Zealand | 268021| >= ALL( /*y.continent='Oceania'*/
7692024, /*Australia*/
268021) /*New Zealand*/
|Africa | Algeria Twin | 2381741| >= ALL( /*y.continent='Africa'*/
2381741, /*Alegria*/
1221037, /*South Africa*/
2381741) /*Alegria Twin*/
+-------------+--------------+--------+
From the above, rows 1, 2 and 5 are >=
all the sub-query areas. So these are kept, while the other rows are discarded.
Note that there are a few ways to write the sub-query that will produce the exact same results.
Being >=
all areas on a continent is the same as being =
the MAX area on the continent.
WHERE area = ( SELECT MAX(y.area)
FROM world y
WHERE y.continent=x.continent)
Another way to get the maximum is to get the first row when ordering by area DESC.
WHERE area = ( SELECT y.area
FROM world y
WHERE y.continent=x.continent
ORDER BY y.area DESC LIMIT 1)
However, be careful of the following which seems equivalent, but isn't.
/* The problem here is that only 1 Algeria will happen to be
first in the sub-query. Meaning 1 row will be missing from
the final result set. */
WHERE name = ( SELECT y.name
FROM world y
WHERE y.continent=x.continent
ORDER BY y.area DESC LIMIT 1)
Finally, I mentioned that correlated sub-queries may have performance concerns. So it's generally advisable to consider rewriting a correlated sub-query into one that joins directly to the sub-query in the FROM clause if you can. E.g.
SELECT x.contient, x.name, x.area
FROM world x
INNER JOIN (
SELECT MAX(y.area) as max_area, y.continent
FROM world y
GROUP BY y.continent
) z ON
x.continent = z.continent
AND x.area = z.max_area
Upvotes: 1
Reputation: 881113
SELECT THIS, name, area
FROM world X
WHERE area >= ALL (SELECT area
FROM world y
WHERE y.continent = X.THIS
AND area > 0)
Note the capitalised X
and THIS
, they are the items that tie together the sub-query with the query.
In terms of functional effect(a), the sub-query only returns rows related to the current row being processed at the query level.
So think of it this way. While processing the continent Africa
, the sub-query is basically:
SELECT area
FROM world y
WHERE y.continent = 'Africa'
AND area > 0
and, because you have WHERE area >= ALL [[that_sub_query]]
in the outer query, it will only give you rows where the area is at least as large as the largest area for that continent.
Or, more succinctly, equal to the largest since, in a given group, no one item can be simultaneously smaller than another and larger or equal to them all.
(a) How it works under the covers may be vastly different but the effect is what we're concerned with here.
Upvotes: 1