Reputation: 19
I'm trying to do some query optimization; taking an SQL query into relational algebra and optimizing it.
My db tables schemas are as follow:
Hills(MId, Mname, Long, Lat, Height, Rating,... )
Runners(HId, HName, Age, Skill,... )
Runs(MId, CId, Date, Duration)
Where there may be many columns in Runners and Hills.
My SQL query is:
SELECT DISTINCT Runners.HName, Runners.Age
FROM Hills, Runners, Runs
WHERE Runners.HId = Runs.HId AND Runs.MID = Hills.MId AND Height > 1200
So i could start by doing:
π Name, Age(σ Height > 1200 (Hills × Runners × Runs))
Or something like this and then optimizing it with a good choice of joins, but i'm not sure where to start
Upvotes: 1
Views: 2278
Reputation: 754110
You could start by using the SQL join notation:
SELECT DISTINCT P.HName, P.Age
FROM Hills AS H
JOIN Runs AS R ON H.MId = R.MId
JOIN Runners AS P ON P.HId = R.HId
WHERE H.Height > 1200
You can then observe that the WHERE condition applies only to Hills, so you could push down the search criterion:
SELECT DISTINCT P.HName, P.Age
FROM (SELECT MId FROM Hills WHERE Height > 1200) AS H
JOIN Runs AS R ON H.MId = R.MId
JOIN Runners AS P ON P.HId = R.HId
This is a standard optimization - and one which the SQL optimizer will do automatically. In fact, it probably isn't worth doing much rewriting of the first query shown because the optimizer can deal with it. The other optimization I see as possible is pushing the DISTINCT operation down a level:
SELECT P.HName, P.Age
FROM (SELECT DISTINCT R.HId
FROM (SELECT MId FROM Hills WHERE Height > 1200) AS H
JOIN Runs AS R ON H.MId = R.MId
) AS R1
JOIN Runners AS P ON P.HId = R1.HId
This keeps the intermediate result set as small as possible: R1 contains a list of ID-values for the people who have run at least one 1200 metre (or is that 1200 feet?) hill, and can be joined 1:1 with the details in the Runners table. It would be interesting to see whether an optimizer is able to deduce the push-down of the DISTINCT for itself.
Of course, in relational algebra, the DISTINCT operation is done 'automatically' - every result and intermediate result is always a relation with no duplicates.
Given the original 'relational algebra' notation:
This corresponds to the first SQL statement above.
The second SQL statement then corresponds (more or less) to:
The third SQL statement then corresponds (more or less) to:
Where I'm assuming that parentheses force the relational algebra to evaluate expressions in order. I'm not sure that I've got the minimum possible number of parentheses in there, but the ones that are there don't leave much wriggle room for ambiguity.
Upvotes: 3