Reputation: 22500
I am trying to understand the semantics of SQL GROUP BY
and HAVING
clauses by mimicking it in Haskell with a minimal mock-up.
For example, the following SQL taken from the Postgres tutorial
SELECT city, max(temp_lo)
FROM weather
GROUP BY city
HAVING max(temp_lo) < 40;
return the cities that have all lowest temperature values below 40.
My understanding is that HAVING
depends on GROUP BY
, so HAVING
can be implemented on top of GROUP BY
. To keep things simple, I use ordinal numbers to represent columns/fields. So here is what I have:
import Data.List
import Data.Function
at = flip (!!)
groupBySql n = groupBy ((==) `on` (at n)). sortBy (comparing (at n) )
havingGroupBy f n tab = [ filter f | g <- groupBySql 2 tab ]
And the test data:
--mimick weather table: temp_lo, temp_hi, city
weather = [ [10, 30, 1],
[45, 99, 2],
[0, 3, 3],
[10, 35, 1],
[55, 103, 2],
[5, 29, 3]
]
test1 = havingGroupBy ( (<40).(at 0) ) 2 weather
But I have an error in writing havingGroupBy
, and run out of time/energy.
*Main> test1
<interactive>:53:1:
No instance for (Show ([[b0]] -> [[b0]]))
(maybe you haven't applied enough arguments to a function?)
A bunch of questions here:
Obviously, I didn't get the types right while mimicking the HAVING
, how to fix it.
Is there a way to add type notation just for a parameter like f
in havingGroupBy f n tab = [ filter f | g <- groupBySql 2 tab ]
, for debugging? My intended type for f
is a boolean function on a row: [a]->Bool
.
Is my understanding of GROUP BY
and HAVING
correct so far?
Upvotes: 1
Views: 438
Reputation: 105885
Note: This post is written in literate Haskell. You can save it as Main.lhs and try it in your GHCi.
To model this in Haskell, we should use the appropriate types:
> import Data.Function (on)
> import Data.List (maximumBy, sortBy, groupBy)
> import Data.Ord (comparing)
>
> type WEntry = (Int, Int, String)
> temp_lo (t, _, _) = t
> city (_, _, c) = c
Now, a select in SQL can be thought as a function from one (or many) table(s) to a table of results:
> type Result = (Int, String)
> select :: [WEntry] -> [Result]
> select = map (\(l,h,c) -> (l, c))
Note that this does not account aggregation. Now, we group the entries by their cities:
> groupByCity :: [WEntry] -> [[WEntry]]
> groupByCity = groupBy ((==) `on` city) . sortBy (comparing city)
Since both HAVING
and the original SELECT
use max(temp_lo)
, we aggregate the maximum lowest temperature here for some optimization:
> aggregate :: [[WEntry]] -> [WEntry]
> aggregate = map (maximumBy (comparing temp_lo))
You can think of groupByCity
and GROUP BY
as creating temporary sub tables. If we wouldn't use aggregates those tables would be concatenated in the end.
Now we need to filter those with a valid temperature:
> having :: [WEntry] -> [WEntry]
> having = filter ((<40) . temp_lo)
And then put all things together:
> query :: [WEntry] -> [Result]
> query = select . having . aggregate . groupByCity
An alternative would have been
havingCity :: [[WEntry]] -> [[WEntry]]
havingCity = filter (not . any ((40<=) . temp_lo))
query = select . aggregate . havingCity . groupByCity
The important part is that an aggregate usually folds intermediate results. If you don't use grouping, max(temp_lo)
will return a single value (the maximum lowest temperature of the whole table). If you group, max(temp_lo)
will be maximum lowest temperature of a specific group. Likewise HAVING
makes it possible to check the result an aggregate would return.
Last but not least, here's a main so that you can actually try it:
> weather :: [WEntry]
> weather =
> [ (10, 30, "Berlin")
> , (45, 99, "San Francisco")
> , ( 0, 3, "Springfield")
> , (14, 35, "Berlin")
> , (55, 103, "San Francisco")
> , ( 5, 29, "Springfield")
> ]
>
> main = print $ query weather
> -- Result: [(14,"Berlin"),(5,"Springfield")]
Upvotes: 1