user2838966
user2838966

Reputation: 683

Need help optimizing outer join SQL query

I am hoping to get some advice on how to optimize the performance of this query I have with an outer join. First I will explain what I am trying to do and then I'll show the code and results.

I have an Accounts table that has a list of all customer accounts. And I have a datausage table which keeps track of how much data each customer is using. A backend process running on multiple servers inserts records into the datausage table each day to keep track of how much usage occurred that day for each customer on that server.

The backend process works like this - if there is no activity on that server for an account on that day, no records are written for that account. If there is activity, one record is written with a "LogDate" of that day. This is happening on multiple servers. So collectively the datausage table winds up with no rows (no activity at all for that customer each day), one row (activity was only on one server for that day), or multiple rows (activity was on multiple servers for that day).

We need to run a report that lists ALL customers, along with their usage for a specific date range. Some customers may have no usage at all (nothing whatsoever in the datausage table). Some customers may have no usage at all for the current period (but usage in other periods).

Regardless of whether there is any usage or not (ever, or for the selected period) we need EVERY customer in the Accounts table to be listed in the report, even if they show no usage. Therefore it seems this required an outer join.

Here is the query I am using:

SELECT
   Accounts.accountID as AccountID,
   IFNULL(Accounts.name,Accounts.accountID) as AccountName,
   AccountPlans.plantype as AccountType,
   Accounts.status as AccountStatus,
   date(Accounts.created_at) as Created,
   sum(IFNULL(datausage.Core,0) + (IFNULL(datausage.CoreDeluxe,0) * 3)) as 'CoreData'
FROM `Accounts` 
 LEFT JOIN `datausage` on `Accounts`.`accountID` = `datausage`.`accountID`
 LEFT JOIN `AccountPlans` on `AccountPlans`.`PlanID` = `Accounts`.`PlanID`
WHERE
(
   (`datausage`.`LogDate` >= '2014-06-01' and `datausage`.`LogDate` < '2014-07-01') 
   or `datausage`.`LogDate` is null
) 
GROUP BY Accounts.accountID 
ORDER BY `AccountName` asc 

This query takes about 2 seconds to run. However it only takes 0.3 seconds to run if the "or datausage.LogDate is NULL" is removed. However, it seems I must have that clause in there, because accounts with no usage are excluded from the result set if that does not appear.

Here is the table data:

| id | select_type | table        | type   | possible_keys                                           | key     | key_len | ref                  | rows  | Extra                                                  |
+----+-------------+--------------+--------+---------------------------------------------------------+---------+---------+----------------------+-------    +----------------------------------------------------+
|  1 | SIMPLE      | Accounts     | ALL    | PRIMARY,accounts_planid_foreign,accounts_cardid_foreign | NULL    | NULL    | NULL                 |    57 | Using     temporary; Using filesort                    |
|  1 | SIMPLE      | datausage   | ALL    | NULL                                                    | NULL    | NULL    | NULL                 | 96805 | Using where;     Using join buffer (Block Nested Loop) |
|  1 | SIMPLE      | AccountPlans | eq_ref | PRIMARY                                                 | PRIMARY | 4       | mydb.Accounts.planID |     1 | NULL                                                   |

The indexes on Accounts table are as follows:

| Table    | Non_unique | Key_name                | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Accounts |          0 | PRIMARY                 |            1 | accountID   | A         |          57 |     NULL | NULL   |      | BTREE      |         |               |
| Accounts |          1 | accounts_planid_foreign |            1 | planID      | A         |           5 |     NULL | NULL   |      | BTREE      |         |               |
| Accounts |          1 | accounts_cardid_foreign |            1 | cardID      | A         |           0 |     NULL | NULL   | YES  | BTREE      |         |               |

The index on the datausage table is as follows:

| Table      | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| datausage |          0 | PRIMARY  |            1 | UsageID     | A         |       96805 |     NULL | NULL   |      | BTREE      |         |               |

I tried creating different indexes on datausage to see if it would help, but nothing did. I tried an index on AccountID, an index on AccountID, LogData, and index on LogData, AccountID, and an index on LogData. None of these made any difference.

I also tried using a UNION ALL with one of the queries with the logdata range and the other query just where logdata is null, but the result was about the same (actually a bit worse).

Can someone please help me understand what may be going on and the ways in which I can optimize the query execution time? Thank you!!

UPDATE: At Philipxy's request, here are the table definitions. Note that I removed some columns and constraints that are not related to this query to help keep things as tight and clean as possible.

CREATE TABLE `Accounts` (
   `accountID` varchar(25) NOT NULL,
   `name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
   `status` int(11) NOT NULL,
   `planID` int(10) unsigned NOT NULL DEFAULT '1',
   `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
   PRIMARY KEY (`accountID`),
   KEY `accounts_planid_foreign` (`planID`),
   KEY `acctname_id_ndx` (`name`,`accountID`),
   CONSTRAINT `accounts_planid_foreign` FOREIGN KEY (`planID`) REFERENCES `AccountPlans` (`planID`)
   ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci 


CREATE TABLE `datausage` (
   `UsageID` int(11) NOT NULL AUTO_INCREMENT,
   `Core` int(11) DEFAULT NULL,
   `CoreDelux` int(11) DEFAULT NULL,
   `AccountID` varchar(25) DEFAULT NULL,
   `LogDate` date DEFAULT NULL
   PRIMARY KEY (`UsageID`),
   KEY `acctusage` (`AccountID`,`LogDate`)
   ) ENGINE=MyISAM AUTO_INCREMENT=104303 DEFAULT CHARSET=latin1 


CREATE TABLE `AccountPlans` (
   `planID` int(10) unsigned NOT NULL AUTO_INCREMENT,
   `name` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
   `params` text COLLATE utf8_unicode_ci NOT NULL,
   `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
   `plantype` varchar(25) COLLATE utf8_unicode_ci NOT NULL,
   PRIMARY KEY (`planID`),
   KEY `acctplans_id_type_ndx` (`planID`,`plantype`)
 ) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci 

Upvotes: 1

Views: 174

Answers (2)

philipxy
philipxy

Reputation: 15118

When you left join with datausage you should restrict the output as much as possible right there. (JOIN means AND means WHERE means ON. Put the conditions in essentially whatever order will be clear and/or optimize when necessary.) The result will be a null-extended row when there was no usage; you want to leave that row in.

When you join with AccountPlans you don't want to introduce null rows (which can't happen anyway) so that's just an inner join.

The version below has the AccountPlan join as an inner join and put first. (Indexed) Accounts FK PlanID to AccountPlan means the DBMS knows the inner join will only ever generate one row per Accounts PK. So the output has key AccountId. That row can be immediately inner joined to datausage. (An index on its AccountID should help, eg for a merge join.) For the other way around there is no PlanID key/index on the outer join result to join with AccountPlan.

SELECT
   a.accountID as AccountID,
   IFNULL(a.name,a.accountID) as AccountName,
   ap.plantype as AccountType,
   a.status as AccountStatus,
   date(a.created_at) as Created,
   sum(IFNULL(du.Core,0) + (IFNULL(du.CoreDeluxe,0) * 3)) as CoreData
FROM Accounts a
 JOIN AccountPlans ap ON ap.PlanID = a.PlanID
 LEFT JOIN datausage du ON a.accountID = du.accountID AND du.LogDate >= '2014-06-01' AND du.LogDate < '2014-07-01'
GROUP BY a.accountID

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269763

First, you can simplify the query by moving the where clause to the on clause:

SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
       ap.plantype as AccountType, a.status as AccountStatus,
       date(a.created_at) as Created,
       sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN 
     datausage du
     on a.accountID = du.`accountID` AND
        du.`LogDate` >= '2014-06-01' and du.`LogDate` < '2014-07-01'
LEFT JOIN 
     AccountPlans ap
     on ap.`PlanID` = a.`PlanID`
GROUP BY a.accountID 
ORDER BY AccountName asc ;

(I also introduced table aliases to make the query easier to read.)

This version should make better uses of indexes because it eliminates the or in the where clause. However, it still won't use an index for the outer sort. The following might be better:

SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
       ap.plantype as AccountType, a.status as AccountStatus,
       date(a.created_at) as Created,
       sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN 
     datausage du
     on a.accountID = du.`accountID` AND
        du.LogDate >= '2014-06-01' and du.LogDate < '2014-07-01'LEFT JOIN 
     AccountPlans ap
     on ap.PlanID = a.PlanID
GROUP BY a.accountID 
ORDER BY a.name, a.accountID ;

For this, I would recommend the following indexes:

Accounts(name, AccountId)
Datausage(AccountId, LogDate)
AccountPlans(PlanId, PlanType)

Upvotes: 1

Related Questions