Spawn
Spawn

Reputation: 19

Databases Relational algebra optimization

I'm trying to do some query optimization; taking an SQL query into relational algebra and optimizing it.

My db tables schemas are as follow:

Hills(MId, Mname, Long, Lat, Height, Rating,... )
Runners(HId, HName, Age, Skill,... )
Runs(MId, CId, Date, Duration)

Where there may be many columns in Runners and Hills.

My SQL query is:

SELECT DISTINCT Runners.HName, Runners.Age
FROM Hills, Runners, Runs
WHERE Runners.HId = Runs.HId AND Runs.MID = Hills.MId AND Height > 1200

So i could start by doing:

π Name, Age(σ Height > 1200 (Hills × Runners × Runs))

Or something like this and then optimizing it with a good choice of joins, but i'm not sure where to start

Upvotes: 1

Views: 2278

Answers (1)

Jonathan Leffler
Jonathan Leffler

Reputation: 754110

You could start by using the SQL join notation:

SELECT DISTINCT P.HName, P.Age
  FROM Hills   AS H
  JOIN Runs    AS R ON H.MId = R.MId
  JOIN Runners AS P ON P.HId = R.HId
 WHERE H.Height > 1200

You can then observe that the WHERE condition applies only to Hills, so you could push down the search criterion:

SELECT DISTINCT P.HName, P.Age
  FROM (SELECT MId FROM Hills WHERE Height > 1200) AS H
  JOIN Runs    AS R ON H.MId = R.MId
  JOIN Runners AS P ON P.HId = R.HId

This is a standard optimization - and one which the SQL optimizer will do automatically. In fact, it probably isn't worth doing much rewriting of the first query shown because the optimizer can deal with it. The other optimization I see as possible is pushing the DISTINCT operation down a level:

SELECT P.HName, P.Age
  FROM (SELECT DISTINCT R.HId
          FROM (SELECT MId FROM Hills WHERE Height > 1200) AS H
          JOIN Runs AS R ON H.MId = R.MId
       ) AS R1
  JOIN Runners AS P ON P.HId = R1.HId

This keeps the intermediate result set as small as possible: R1 contains a list of ID-values for the people who have run at least one 1200 metre (or is that 1200 feet?) hill, and can be joined 1:1 with the details in the Runners table. It would be interesting to see whether an optimizer is able to deduce the push-down of the DISTINCT for itself.

Of course, in relational algebra, the DISTINCT operation is done 'automatically' - every result and intermediate result is always a relation with no duplicates.


Given the original 'relational algebra' notation:

  • π Name, Age(σ Height > 1200 (Hills × Runners × Runs))

This corresponds to the first SQL statement above.

The second SQL statement then corresponds (more or less) to:

  • π Name, Age ((π MId (σ Height > 1200 (Hills))) × Runners × Runs)

The third SQL statement then corresponds (more or less) to:

  • π Name, Age ((π HId ((π MId (σ Height > 1200 (Hills))) × Runs)) × Runners)

Where I'm assuming that parentheses force the relational algebra to evaluate expressions in order. I'm not sure that I've got the minimum possible number of parentheses in there, but the ones that are there don't leave much wriggle room for ambiguity.

Upvotes: 3

Related Questions