Reputation: 31221

Prevent full table scan for query with multiple where clauses

A while ago I posted a message about optimizing a query in MySQL. I have since ported the data and query to PostgreSQL, but now PostgreSQL has the same problem. The solution in MySQL was to force the optimizer to not optimize using STRAIGHT_JOIN. PostgreSQL offers no such option.

Update Revised

I have isolated the part of the query that fixes the problem (d.month_ref_id = 1):

select
  d.*
from
  daily d
join month_ref m on m.id = d.month_ref_id 
join year_ref y on y.id = m.year_ref_id
where
  m.category_id = '001' and
  d.month_ref_id = 1

However, I can't hard-code a month reference to 1. The query that produces a full table scan is:

select
  d.*
from
  daily d
join month_ref m on m.id = d.month_ref_id 
join year_ref y on y.id = m.year_ref_id
where
  m.category_id = '001'

The index on daily.month_ref_id is:

CREATE INDEX daily_month_ref_idx
  ON climate.daily
  USING btree
  (month_ref_id);

Why is the query performing a full table scan and what can be done to avoid it?

Thank you!

Upvotes: 1

Answers (3)

Tom Drake

Reputation: 537

I think the FTS is occurring because of the way you have parameterized your query / joins. By this, I mean that you have two parameters, one is being compared against a column in the 'daily' table, and another is being compared against a column in the 'month-ref' table. However, both of these values can be used to filter rows in a single table 'month-ref'. Make that table the primary table in your query and re-write your query as follows:

select
  d.*
from month_ref m
join daily d on d.month_ref_id = m.id
join year_ref y on y.id = m.year_ref_id
where
  m.category_id = '001' and
  m.id = 1

This way, the database can easily locate all the necessary rows in the month-ref table based solely on the input parameter values, and rows in the daily table can be easily located over the specified join using the index you described. Depending on the number of rows might be found in the month-ref table, and whether either of the columns referenced above contain distinct values, you may need to create an index on the month-ref table.

Upvotes: 0

Thomas

Reputation: 64674

Even though it might not make much a performance difference, I would use Join clauses for join the tables instead of cross joins and the Where clause.
You are calling a function in the Where clause which will cause the system to do a table scan. It will not matter what database you use, this is going to be true.
Why the Left Join on City? Do you know for a fact that the given Id will exist (in this case 10663? If so, you should use an inner join.
You might be able to given the compiler hints about how to formulate the query using parenthesis (I'm not sure if Postgres will honor them).

Select  avg(d.amount) AS amount,  y.year
From (station s
        Left Join city c -- You want to cross join on city? Why not use an Inner join?
            On c.id = 10663
                And 6371.009 
                  * SQRT( 
                        POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) 
                        + (
                            COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) 
                            * POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2)
                            )
                        ) <= 50)
    Join station_district sd
        On sd.Id = s.station_district_id
    Join year_ref y
        On y.station_district_id = sd.id
    Join month_ref m
        On m.year_ref_id = y.id
    Join daily d
        On d.month_ref_id = m.id
Where s.elevation Between 0 And 2000 
    And y.year Between 1980 And 2000
    And m.month = 12
    And m.category_id = '001'
    And d.daily_flag_id <> 'M'
Group By y.year

Since you are not using the station, station_district nor city table in the results, you might be able to move those to an exists statement:

Select  avg(d.amount) AS amount,  y.year
From year_ref y
    Join month_ref m
        On m.year_ref_id = y.id
    Join daily d
        On d.month_ref_id = m.id
Where y.year Between 1980 And 2000
    And m.month = 12
    And m.category_id = '001'
    And d.daily_flag_id <> 'M'
    And Exist   (
                Select 1
                From station s1
                    Join city c1
                        On c1.id = 10663
                Where 6371.009 
                      * SQRT( 
                            POW(RADIANS(c1.latitude_decimal - s1.latitude_decimal), 2) 
                            + (
                                COS(RADIANS(c1.latitude_decimal + s1.latitude_decimal) / 2) 
                                * POW(RADIANS(c1.longitude_decimal - s1.longitude_decimal), 2)
                                )
                            ) <= 50
                    And S1.station_district_id = y.station_district_id
                )
Group By y.year

Upvotes: 3

mdma

Reputation: 57787

I don't know what other variations of the query you've tried, but the JOIN on City seems a little strange - have you tried replacing that with a WHERE clause? Also, the relationships between the various tables are currently in the WHERE clause - these are probably best implemented as an INNER JOIN.

Disclaimer: I don't know PostreSQL specifically.

EDIT: Here's a link that describes changing WHERE clauses to JOINs to influence join order, and discusses the join_collapse_limit to force the optimizer to use your specified join order. http://www.postgresql.org/docs/8.2/static/explicit-joins.html

EDIT2: Another alternative is to nest SELECT statements, which may also force the optimizer to construct the query in the (reverse) nesting order you specify.

Upvotes: 1

Prevent full table scan for query with multiple where clauses

Answers (3)

Related Questions