Reputation: 83
I'm a beginner at SQL and this is the question I have been asked to solve:
Say that a big city is defined as a
place
of typecity
with a population of at least 100,000. Write an SQL query that returns the scheme(state_name,no_big_city,big_city_population)
ordered bystate_name
, listing those states which have either (a) at least five big cities or (b) at least one million people living in big cities. The columnstate_name
is thename
of thestate
,no_big_city
is the number of big cities in the state, andbig_city_population
is the number of people living in big cities in the state.
Now, as far as I can see, the following query returns correct results:
SELECT state.name AS state_name
, COUNT(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN 1 ELSE NULL END) AS no_big_city
, SUM(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN place.population ELSE NULL END) AS big_city_population
FROM state
JOIN place
ON state.code = place.state_code
GROUP BY state_name
HAVING
COUNT(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN 1 ELSE NULL END) >= 5 OR
SUM(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN place.population ELSE NULL END) >= 1000000
ORDER BY state_name;
However, the two aggregate functions used in the code appear twice. MY question: is there any way of making this code duplication disappear preserving functionality?
To be clear, I have already tried using the alias, but I just get a "column does not exist" error.
Upvotes: 6
Views: 8187
Reputation: 656636
An output column's name can be used to refer to the column's value in
ORDER BY
andGROUP BY
clauses, but not in theWHERE
orHAVING
clauses; there you must write out the expression instead.
Bold emphasis mine.
You can avoid typing long expressions repeatedly with a subquery or CTE:
SELECT state_name, no_big_city, big_city_population
FROM (
SELECT s.name AS state_name
, COUNT(*) FILTER (WHERE p.type = 'city' AND p.population >= 100000) AS no_big_city
, SUM(population) FILTER (WHERE p.type = 'city' AND p.population >= 100000) AS big_city_population
FROM state s
JOIN place p ON s.code = p.state_code
GROUP BY s.name -- can be input column name as well, best schema-qualified to avoid ambiguity
) sub
WHERE no_big_city >= 5
OR big_city_population >= 1000000
ORDER BY state_name;
While being at it, I simplified with the aggregate FILTER
clause (Postgres 9.4+):
However, I suggest this simpler and faster query to begin with:
SELECT s.state_name, p.no_big_city, p.big_city_population
FROM state s
JOIN (
SELECT state_code AS code -- alias just to simplify join
, count(*) AS no_big_city
, sum(population) AS big_city_population
FROM place
WHERE type = 'city'
AND population >= 100000
GROUP BY 1 -- can be ordinal number referencing position in SELECT list
HAVING count(*) >= 5 OR sum(population) >= 1000000 -- simple expressions now
) p USING (code)
ORDER BY 1; -- can also be ordinal number
I am demonstrating another option to reference expressions in GROUP BY
and ORDER BY
. Only use that if it doesn't impair readability and maintainability.
Upvotes: 8
Reputation: 4120
SELECT clause is what you want to select from the filtred by WHERE clause table(s).
GROUP BY is a condition how to group filtered records to use in aggregation functions in the SELECT. So alias cannot be there.
But you can wrap your filtered records and select from them. Something like that:
SELECT state_name, no_big_city, big_city_population
FROM
(
SELECT
state.name AS state_name,
COUNT(1) no_big_city,
MAX(place.population) max_city_population,
SUM(place.population) AS big_city_population
FROM state JOIN place ON state.code = place.state_code
WHERE
place.type = 'city' AND
place.population >= 100000
GROUP BY state.name
)
WHERE
no_big_city >= 5 OR
max_city_population > 1000000
ORDER BY state_name
Also, moving conditions
place.type = 'city' AND
place.population >= 100000
out of CASE to WHERE will perform better. "No city" or "small city records will not be processed. especially if there is an index on place.type column.
Upvotes: 0
Reputation: 775
Not sure if this is a comment or an answer, since it is more preference based as opposed to technical, but I'll post it anyway
What I usually do when I need to reference calculated columns (usually a LOT at the same time) is I put my calculated columns within a derived table and then reference the calculated columns using its alias outside of the derived table. This syntax should be ANSI-SQL correct, but I am not familiar with PostGRES
select * from (
SELECT STATE.NAME AS state_name
,COUNT(CASE WHEN place.type = 'city'
AND place.population >= 100000 THEN 1 ELSE NULL END) AS no_big_city
,SUM(CASE WHEN place.type = 'city'
AND place.population >= 100000 THEN place.population ELSE NULL END) AS big_city_population
FROM STATE
INNER JOIN place
ON STATE.code = place.state_code
GROUP BY state_name
) sub
where no_big_city >= 5
and big_city_population >=100000
--HAVING COUNT(CASE WHEN place.type = 'city'
-- AND place.population >= 100000 THEN 1 ELSE NULL END) >= 5
-- OR SUM(CASE WHEN place.type = 'city'
-- AND place.population >= 100000 THEN place.population ELSE NULL END) >= 1000000
ORDER BY state_name;
The nice thing about this approach is, although you are adding complication via a subquery/derived table, the formula is kept in one place, so any changes only have to happen once. I do not know if this will perform worse than simply repeating the calcuation in the group-by, but I can't imagine it would be that much worse.
Upvotes: 1