kevinn2065
kevinn2065

Reputation: 375

Mysql Group Join Optimization Issue

I'm trying to optimize this query it returns multiple rows from the building_rent_prices and the building_weather and then groups them and calculates the average of their field. So far the tables are all under a million rows yet it takes several seconds, does anyone know how i could optimize this from composite indexes or rewriting the query? I'm assuming it should be able to be a 100ms or quicker query but so far it seems like it cant

SELECT b.*
     , AVG(r.rent)
     , AVG(w.high_temp)
  FROM buildings b
  LEFT 
  JOIN building_rent_prices r
    ON r.building_id = b.building_id 
  LEFT 
  JOIN building_weather w
    ON w.building_id = b.building_id 
 WHERE w.date BETWEEN CURDATE() AND CURDATE + INTERVAL 4 DAY
   AND r.date BETWEEN CURDATE() AND CURDATE + INTERVAL 10 day
 GROUP  
    BY b.building_id
 ORDER  
    BY AVG(r.rent) / b.square_feet DESC
 LIMIT 10;  

Explain said the following:

1 SIMPLE building_rent_prices range

1 SIMPLE buildings eq_ref

1 SIMPLE building_weather ref

Using where; Using index; Using temporary; Using filesort

Using where

Using where; Using index

Im working on some test data heres the create table

CREATE TABLE building(
building_id INT PRIMARY KEY AUTO_INCREMENT, 
name VARCHAR(255),
square_feet INT
);

CREATE TABLE building_weather(
building_weather_id INT PRIMARY KEY AUTO_INCREMENT, 
building_id INT,
weather_date DATE,
high_temp INT
);

CREATE TABLE building_rates(
building_rate_id INT PRIMARY KEY AUTO_INCREMENT, 
building_id INT,
weather_date DATE,
rate double
);

ALTER TABLE building_rates INDEX(building_id);
ALTER TABLE buildings INDEX(building_id);
ALTER TABLE building_weather INDEX(building_id);

This seems to working in under 1 second based on DRapp's answer without indexes(I still need to test that its valid)

select 
  B.*, 
  BRP.avgRent, 
  BW.avgTemp
   from 
   ( select building_id,
            AVG( rent ) avgRent
         from
            building_rent_prices
         where
            date BETWEEN CURDATE() AND CURDATE() + 10
         group by
            building_id
         order by
            building_id ) BRP
     JOIN buildings B
        on BRP.building_id = B.building_id
     left join ( select building_id,
                        AVG( hi_temp ) avgTemp
                     from building_weather 
                     where date BETWEEN CURDATE() AND CURDATE() + 10
                     group by building_id) BW
        on BRP.building_id =  BW.building_id
   GROUP BY BRP.building_id
 ORDER BY BRP.avgRent / 1 DESC
   LIMIT 10;

Upvotes: 0

Views: 53

Answers (4)

kevinn2065
kevinn2065

Reputation: 375

For anyone who has issues similar to mine the solution is to GROUP each table you would like to join using building_id that way you are joining one to one with every average. Ollie Jones query with JOIN rather than LEFT JOIN is the closest answer if you do not want results that don't have data in all tables. Also The main issue I had was that I forgot to place an index on a avg(low_temp) column so the INDEXES. What I learned from this is that if you do an aggregated function in your select it belongs in your indexes. I added low_temp to it.

building_weather (date, building_id, hi_temp, low_temp) AS suggested by Ollie and DR APP

ALTER TABLE building_weather ADD index(date, building_id, hi_temp, low_temp);

SELECT buildings.*, a.rent, b.high_temp, b.low_temp
  FROM buildings
  JOIN (
      SELECT building_id, AVG(rent) rent
        FROM building_rent_prices
       WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
       GROUP BY building_id
   ) AS a ON buildings.building_id = a.building_id
  JOIN (
      SELECT building_id, AVG(high_temp) high_temp, AVG(low_temp) low_temp
        FROM building_weather
       WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 4 DAY
       GROUP BY building_id
   ) AS b ON buildings.building_id = b.building_id
 ORDER BY a.rent / buildings.square_feet DESC
 LIMIT 10

Upvotes: 0

DRapp
DRapp

Reputation: 48139

First, your query to the WEATHER based table is only for 4 days, the RENT PRICES table is for 10 days. Since you don't have any join correlation between the two, you will result in a Cartesian result of 40 records per one building ID. Was that intentional or just not identified as an oops...

Second, I would adjust the query as I have below, but also, I have adjusted BOTH WEATHER and RENT PRICES tables to reflect the same date range period. I start with an sub query of just the prices and group by building and date, then join to buildings, then another sub query to weather grouped by building and date. But here, I join from the rent prices sub query to the weather sub query on both building ID AND date so it will at most retain a 1:1 ratio. I don't know why weather is even a consideration spanning date ranges.

However to help with indexes, I would suggest the following

Table                Index on
buildings            (Building_ID)  <-- probably already exists as a PK
building_rent_prices (date, building_id, rent)
building_weather     (date, building_id, hi_temp)

The purpose of the index is to take advantage of the WHERE clause (date first), THEN the GROUP BY ( building ID), and is a COVERING INDEX (includes the rent). Similarly for the building weather table for same reasons.

select 
      B.*, 
      BRP.avgRent, 
      BW.avgTemp
   from 
       ( select building_id,
                AVG( rent ) avgRent
             from
                building_rent_prices
             where
                date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
             group by
                building_id
             order by
                building_id ) BRP

         JOIN buildings B
            on BRP.building_id = B.building_id

         left join ( select building_id,
                            AVG( hi_temp ) avgTemp
                         from
                            building_weather 
                         where
                            date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
                         group by
                            building_id ) BW
            on BRP.building_id =  BW.building_id

   GROUP BY 
      BRP.building_id

   ORDER BY 
      BRP.avgRent / B.square_feet DESC

   LIMIT 10;

CLARIFICATION...

I cant guarantee the execution order, but in essence, the two ( queries ) for BPR and BW aliases, they would be done and executed quickly before any join took place. If you wanted the average across the (in my example) 10 days vs a per-day join, then I have removed the "date" as a component of the group, so each will return respectively at most, 1 per building.

Now, joining to the building table on just the 1:1:1 ratio will limit the records in the final result set. This should take care of your concern of the average over those days in question.

Upvotes: 1

O. Jones
O. Jones

Reputation: 108651

Let's take a look at this query in detail. You want to report two different kinds of averages for each building. You need to compute those in separate subqueries. If you don't you'll get the cartesian combinatorial explosion.

One is an average of eleven days' worth of rent prices. You get that data with this subquery:

          SELECT building_id, AVG(rent) rent
            FROM building_rent_prices
           WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
           GROUP BY building_id

This subquery can be optimized by a compound covering index on building_rent_prices, consisting of (date, building_id, rent).

The next is an average of five days' worth of temperature.

          SELECT building_id, AVG(high_temp) high_temp
            FROM building_weather
           WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 4 DAY
           GROUP BY building_id

This can be optimized by a compound covering index on building_weather, consisting of (date, building_id, high_temp).

Finally, you need to join these two subqueries to your buildings table to generate the final result set.

SELECT buildings.*, a.rent, b.high_temp
  FROM buildings
  LEFT JOIN (
          SELECT building_id, AVG(rent) rent
            FROM building_rent_prices
           WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
           GROUP BY building_id
       ) AS a ON buildings.building_id = a.building_id
  LEFT JOIN (
          SELECT building_id, AVG(high_temp) high_temp
            FROM building_weather
           WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 4 DAY
           GROUP BY building_id
       ) AS b ON buildings.building_id = b.building_id
 ORDER BY a.rent / buildings.square_feet DESC
 LIMIT 10

Once the two subqueries are optimized, this one doesn't need anything except the building_id primary key.

In summary, to speed up this query, create the two compound indexes mentioned on the building_rent_prices and building_weather queries.

Upvotes: 1

Rick James
Rick James

Reputation: 142298

Don't use CURDATE + 4:

mysql> select CURDATE(), CURDATE() + 30, CURDATE() + INTERVAL 30 DAY;
+------------+----------------+-----------------------------+
| CURDATE()  | CURDATE() + 30 | CURDATE() + INTERVAL 30 DAY |
+------------+----------------+-----------------------------+
| 2015-03-15 |       20150345 | 2015-04-14                  |
+------------+----------------+-----------------------------+

Add INDEX(building_id) to the second and third tables.

If those don't fix it; come back with a revised query and schema, and I will look deeper.

Upvotes: 1

Related Questions