Vpp Man
Vpp Man

Reputation: 2546

Performance issue on query with math calculations

This my query with its performance (slow_query_log):

SELECT j.`offer_id`, o.`offer_name`, j.`success_rate`
FROM
(
    SELECT 
        t.`offer_id`,
        (
            SUM(CASE WHEN `offer_id` = t.`offer_id` AND `sales_status` = 'SUCCESS' THEN 1 ELSE 0 END) / COUNT(*)
        ) AS `success_rate`
    FROM `tblSales` AS t
    WHERE   DATE(t.`sales_time`) = CURDATE()  
    GROUP BY t.`offer_id`               
    ORDER BY `success_rate` DESC
) AS j
LEFT JOIN `tblOffers` AS o
    ON j.`offer_id` = o.`offer_id`
LIMIT 5;

# Time: 180113 18:51:19
# User@Host: root[root] @ localhost [127.0.0.1]  Id:    71
# Query_time: 10.472599  Lock_time: 0.001000 Rows_sent: 0  Rows_examined: 1156134

Here, tblOffers have all the OFFERS listed. And the tblSales contains all the sales. What am trying to find out is the top selling offers, based on the success rate (ie. those sales which are SUCCESS).

The query works fine and provides the output I needed. But it appears to be that its a bit slower.

offer_id and sales_status are already indexed in the tblSales. So do you have any suggestion on improving the inner query (where it calculates the success rate) so that performance can be improved? I have been playing with the math for more than 2hrs. But couldn't get a better way.

Btw, tblSales has lots of data. It contains those sales which are SUCCESSFUL, FAILED, PENDING, etc.

Thank you


EDIT

As you requested am including the table design also(only relevant fields are included):

tblSales
`sales_id`          bigint UNSIGNED NOT NULL AUTO_INCREMENT,
`offer_id`          bigint UNSIGNED NOT NULL DEFAULT '0',   
`sales_time`        DATETIME NOT NULL DEFAULT  '0000-00-00 00:00:00',   
`sales_status`      ENUM('WAITING', 'SUCCESS', 'FAILED', 'CANCELLED') NOT NULL DEFAULT 'WAITING',
PRIMARY KEY (`sales_id`),
KEY (`offer_id`),
KEY (`sales_status`)

There are some other fields also in this table, that holds some other info. Amount, user_id, etc. which are not relevant for my question.

Upvotes: 0

Views: 615

Answers (4)

Rick James
Rick James

Reputation: 142278

Numerous 'problems', none of which involve "math".

JOINs make things difficult. LEFT JOIN says "I don't care whether the row exists in the 'right' table. (I suspect you don't need LEFT??) But it also says "There may be multiple rows in the right table. Based on the column names, I will guess that there is only one offer_name for each offer_id. If this is correct, then here my first recommendation. (This will convince the Optimizer that there is no issue with the JOIN.) Change from

SELECT ..., o.offer_name, ...
    LEFT JOIN  `tblOffers` AS o  ON j.`offer_id` = o.`offer_id`
    ...

to

SELECT ...,
        ( SELECT offer_name FROM tbloffers WHERE offer_id j.offer_id
        ) AS offer_name, ...

It also gets rid of a bug wherein you are assuming that the inner ORDER BY will be preserved for the LIMIT. This used to be the case, but in newer versions of MariaDB / MySQL, it is not. The ORDER BY in a "derived table" (your subquery) is now ignored.

2 down, a few more to go.

"Don't hide an indexed column in a function." I am referring to DATE(t.sales_time) = CURDATE(). Assuming you have no sales_time values for the 'future', then that test can be changed to t.sales_time >= CURDATE(). If you really need to restrict to just today, then do this:

  AND sales_time >= CURDATE()
  AND sales_time  < CURDATE() + INTERVAL 1 DAY

The ORDER BY and the LIMIT should usually be put together. In your case, you may as well add the LIMIT to the "derived table", thereby leading to only 5 rows for the outer query to work with. But... There is still the question of getting them sorted correctly. So change from

 SELECT ...
     FROM ( SELECT ...
               ORDER BY ... )
     LIMIT ...

to

 SELECT ...
     FROM ( SELECT ...
               ORDER BY ...
               LIMIT 5 )    -- trim sooner
     ORDER BY ...           -- deal with the loss of ordering from derived table

Rolling it all together, I have

SELECT  j.`offer_id`, 
        ( SELECT  offer_name
            FROM  tbloffers
            WHERE  offer_id = j.offer_id 
        ) AS offer_name,
        j.`success_rate`
    FROM  
        ( SELECT  t.`offer_id`,
                  AVG(t.sales_status = 'SUCCESS') AS `success_rate`
            FROM  `tblSales` AS t
            WHERE  t.sales_time >= CURDATE()
            GROUP BY  t.`offer_id`
            ORDER BY  `success_rate` DESC
            LIMIT  5 
        ) AS j
    ORDER BY  `success_rate` DESC;

(I took the liberty of shortening the SUM(...) in two ways.)

Now for the indexes...

tblSales needs at least (sales_time), but let's go for a "covering" (with sales_time specifically first):

INDEX(sales_time, sales_status, order_id)

If tbloffers has PRIMARY KEY(offer_id), then no further index is worth adding. Else, add this covering index (in this order):

INDEX(offer_id, offer_name)

(Apologies to other Answerers; I stole some of your ideas.)

Upvotes: 1

Goose
Goose

Reputation: 546

Without knowing your schema, the lowest hanging fruit I see is this part....

WHERE   DATE(t.`sales_time`) = CURDATE()

Try changing that to something that looks like

Where t.sales_time >= @12-midnight-of-current-date and t.sales_time <= @23:59:59-of-current-date

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269693

Here, tblOffers have all the OFFERS listed. And the tblSales contains all the sales. What am trying to find out is the top selling offers, based on the success rate (ie. those sales which are SUCCESS).

Approach this with a simple JOIN and GROUP BY:

SELECT s.offer_id, o.offer_name,
       AVG(s.sales_status = 'SUCCESS') as success_rate
FROM tblSales s JOIN
     tblOffers o
     ON o.offer_id = s.offer_id
WHERE s.sales_time >= CURDATE() AND
      s.sales_time < CURDATE() + INTERVAL 1 DAY
GROUP BY s.offer_id, o.offer_name              
ORDER BY success_rate DESC;

Notes:

  • The use of date arithmetic allows the query to make use of an index on tblSales(sales_time) -- or better yet tblSales(salesTime, offer_id, sales_status).
  • The arithmetic for success_rate has been simplified -- although this has minimal impact on performance.
  • I added offer_name to the GROUP BY. If you are learning SQL, you should always have all the unaggregated keys in the GROUP BY clause.
  • A LEFT JOIN is only needed if you have offers in tblSales which are not in tblOffers. I am guessing you have proper foreign key relationships defined, and this is not the case.

Upvotes: 0

user2560539
user2560539

Reputation:

Based on not much information that you have provided (i mean table schema) you could try the following.

SELECT `o`.`offer_id`, `o`.`offer_name`, SUM(CASE WHEN `t`.`sales_status` = 'SUCCESS' THEN 1 ELSE 0 END) AS `success_rate`
FROM `tblOffers` `o`
INNER JOIN `tblSales` `t`
ON `o`.`offer_id` = `t`.`offer_id`
WHERE DATE(`t`.`sales_time`) = CURDATE()  
GROUP BY `o`.`offer_id`               
ORDER BY `success_rate` DESC
LIMIT 0,5;

You can find a sample of this query in this SQL Fiddle example

Upvotes: 0

Related Questions