Gihan
Gihan

Reputation: 4283

optimizing mysql query for better performance

I have following query

SELECT o.order_id,
       p.pre_sale_phone_manual_id AS id,
       p.created,
       p.user_id
FROM `order` o
LEFT JOIN `customer` c ON c.customer_id = o.customer_id,
                          `pre_sale_phone_manual` p
LEFT JOIN `pre_sale_phone_manual` p1 ON p.pre_sale_phone_manual_id=p1.pre_sale_phone_manual_id
AND p.created > p1.created
WHERE p1.user_id IS NULL
  AND p.phone <> ""
  AND REPLACE(REPLACE(REPLACE(REPLACE(c.phone, "-", ""), ".", ""), "+", ""), " ", "") LIKE CONCAT('%', RIGHT(REPLACE(REPLACE(REPLACE(REPLACE(p.phone, "-", ""), ".", ""), "+", ""), " ", ""), 10))
  AND o.created > p.created
  AND o.created < (DATE_ADD(p.created, INTERVAL 183 DAY))
  AND o.created > '2013-12-30 08:28:37'

The query basically does is matching the phone numbers of customer's and entry in pre_sale_phone_manual tables. The pre_sale_phone_manual's record should be before order's date and should be within 6 months (183 days) and should match with the pre_sale_phone_manual table's first entry because there can be duplicate entries by other users.

As I've found the slowness is in the join between order table and pre_sale_phone_manual table due to there is no 1 to 1 join and scans the whole tables and obviously for INTERVAL 183 DAY

Following is the EXPLAIN for query

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: o
         type: ALL
possible_keys: order_created_index,fk_order_customer
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 110658
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: p
         type: ALL
possible_keys: created,phone
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2053
        Extra: Using where; Using join buffer
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: p1
         type: eq_ref
possible_keys: PRIMARY,created
          key: PRIMARY
      key_len: 4
          ref: 463832_yii_adm_t4f.p.pre_sale_phone_manual_id
         rows: 1
        Extra: Using where; Not exists
*************************** 4. row ***************************
           id: 1
  select_type: SIMPLE
        table: c
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: 463832_yii_adm_t4f.o.customer_id
         rows: 1
        Extra: Using where

Following stats are from mysql slow query log

Query_time: 126.038395  Lock_time: 0.000303 Rows_sent: 72  Rows_examined: 15266616

Following fields are indexed already,

order.created
pre_sale_phone_manual.created
pre_sale_phone_manual.phone
and PKs and FKs with _id suffix

Please help for optimizing the query and thanks for your time.

Upvotes: 2

Views: 547

Answers (5)

Jens Schauder
Jens Schauder

Reputation: 81907

Tuning is hard, when one doesn't have the exact data to play with. But anyway ...

  1. You have a weird looking self join on pre_sale_phone_manual on the same coloumn on both sides(!?). This looks somewhat like a mistake. Anyway Mysql supports analytic functions, and I think your self join can be transformed to a single table access using those.

  2. Others have already noticed that the like condition on denormalized phone numbers gonna hurt. I'd suggest the following: add a column INVERSE_PHONE on p and c which cotains the phone number, but normalized as needed in your select and from back to front (maintain it using triggers). Put an index that column on p and use it in the where clause. This basically replaces a function based index which it seems where planned for MySql, but are gone with traces as far as I can tell.

  3. If this still doesn't do the trick, do the same for the (DATE_ADD(p.created, INTERVAL 183 DAY)) and put all the columns of p in a single index that get used in the select. Beginning with the most selective column.

  4. all the conditions that have one table on one side and a different one on the other side are part of the join, so put them in the join condition and not in the where clause. This hopefully has no effect on performance, but it makes the statement easier to read.

Upvotes: 1

Kickstart
Kickstart

Reputation: 21513

First thing is that you have mixed implicit and explicit joins. Just for readability use an explicit INNER JOIN for pre_sale_phone_manual. This also should be done with an ON clause.

Further you refer to columns from customer in the WHERE clause which seems to render the left join of customers irrelevant. Change this to an inner join as well.

However this is still not going to be quick. Your join of pre_sale_phone_manual and order is using DATE_ADD which is going to force a calculation on a field and likely prevent any useful use of an index on that join.

The same applies to the check of the phone field on the customer and pre_sale_phone_manual tables (especially as you use a leading wildcard on the LIKE you use).

How many records are there on pre_sale_phone_manual for each resulting row? If a large number it might be worth using a sub query to exclude all but the latest one.

SELECT o.order_id,
       p.pre_sale_phone_manual_id AS id,
       p.created,
       p.user_id
FROM `order` o
INNER JOIN 
(   
    SELECT pre_sale_phone_manual_id, MAX(created) AS max_created
    FROM `pre_sale_phone_manual`
    GROUP BY pre_sale_phone_manual_id
) p_sub
ON o.created > p_sub.max_created AND o.created < (DATE_ADD(p_sub.max_created, INTERVAL 183 DAY))
INNER JOIN pre_sale_phone_manual p
ON p.pre_sale_phone_manual_id =  p_sub.pre_sale_phone_manual_id
AND p.created =  p_sub.max_created 
INNER JOIN `customer` c ON c.customer_id = o.customer_id
WHERE p.phone <> ""
  AND REPLACE(REPLACE(REPLACE(REPLACE(c.phone, "-", ""), ".", ""), "+", ""), " ", "") LIKE CONCAT('%', RIGHT(REPLACE(REPLACE(REPLACE(REPLACE(p.phone, "-", ""), ".", ""), "+", ""), " ", ""), 10))
  AND o.created > '2013-12-30 08:28:37'

Upvotes: 1

Paul Maxwell
Paul Maxwell

Reputation: 35583

There are a few performance "killers":

  1. The Cartesian product of num-rows-of(customer) * num-rows-of(pre_sale_phone_manual)
  2. then the inefficient method matching of c.phone to p.phone
  3. Trying to locate the first record per phone in pre_sale_phone_manual using left join

(Are you trying to find the first record in pre_sale_phone_manual for each phone? I think it's what the code is doing so I have assumed this is the case.)

I can't easily solve item 2. it seems your phone columns can't be trusted 100%, but if this problem was solved the query (I think) might be:

SELECT
      o.order_id
    , p.pre_sale_phone_manual_id AS id
    , p.created
    , p.user_id
FROM `order` o
      INNER JOIN `customer` c
            ON c.customer_id = o.customer_id
      INNER JOIN (
            SELECT
                  pspm.pre_sale_phone_manual_id AS id
                , pspm.created
                , pspm.user_id
                , pspm.phone
            FROM `pre_sale_phone_manual` pspm
                  INNER JOIN (
                        SELECT
                              phone
                            , MIN(created) AS created
                        FROM `pre_sale_phone_manual`
                        GROUP BY
                              phone
                  ) dc
                        ON pspm.created = dc.created 
                        AND pspm.phone = dc.phone
      ) p
            ON c.phone = p.phone /* see notes on this join */
WHERE o.created > p.created
      AND o.created < DATE_ADD(p.created, INTERVAL 183 DAY)
      AND o.created > '2013-12-30 08:28:37'

notes on the phone = phone join (untrustworthy phone columns)

Not a lot a query developer can do unless they also have control over the tables. One method would be to add columns that ARE reliable and index those new columns. MySQL does not have function based indexes or computed columns that I'm aware of, so how you arrive at reliable data is not simple.

This previous question holds a function that may be useful, for example if you added good_phone to customer

 /*
Function From user1467716
https://stackoverflow.com/questions/287105/mysql-strip-non-numeric-characters-to-compare
*/


CREATE FUNCTION STRIP_NON_DIGIT(input VARCHAR(255))
   RETURNS VARCHAR(255)
BEGIN
   DECLARE output   VARCHAR(255) DEFAULT '';
   DECLARE iterator INT          DEFAULT 1;
   WHILE iterator < (LENGTH(input) + 1) DO
      IF SUBSTRING(input, iterator, 1) IN ( '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' ) THEN
         SET output = CONCAT(output, SUBSTRING(input, iterator, 1));
      END IF;
      SET iterator = iterator + 1;
   END WHILE;
   RETURN output;
END
//

update customer
set good_phone = strip_non_digit(InputPhone)
;
//

If you aren't able to solve the unreliable phone data then you suffer the performance that implies and instead of "phone = phone" you will need to continue with:

AND REPLACE(REPLACE(REPLACE(REPLACE(c.phone, "-", ""), ".", ""), "+", ""), " ", "") etc.

Upvotes: 3

Andreas Kalin
Andreas Kalin

Reputation: 326

So, just to repeat what others and myself have already written:

  • You are actually doing an expensive CROSS JOIN with pre_sale_phone_manual_id. All rows on the left side combined with all rows on the right side. That's a bunch of rows.
  • Despite the LEFT JOIN on customer, you are in fact already doing an INNER JOIN, due to a WHERE condition (the LIKE condition).
  • You would benefit from normalizing phone numbers.
  • LIKE conditions do fully not benefit from indexes if the condition starts with a wildcard '%'. (It can benifit to some extent if the index is small enough to fit in PM, since the index scan will be quicker. But it will still be O(n) rather than O(log(n)))

I have made a trivial, obviously untested, rewrite under the assumption that the OUTER JOINs and the CROSS JOIN are not required, i.e. that you always have a record in pre_sale_phone_manual_id. You could try it out if the assumption is valid.

SELECT o.order_id,
       p.pre_sale_phone_manual_id AS id,
       p.created,
       p.user_id
FROM `order` o
JOIN `customer` c ON c.customer_id = o.customer_id,
JOIN `pre_sale_phone_manual` p
LEFT JOIN `pre_sale_phone_manual` p1 
    ON p.pre_sale_phone_manual_id=p1.pre_sale_phone_manual_id
    AND p.created > p1.created
WHERE p1.user_id IS NULL
  AND p.phone <> ""
  AND REPLACE(REPLACE(REPLACE(REPLACE(c.phone, "-", ""), ".", ""), "+", ""), " ", "") 
      LIKE CONCAT('%', RIGHT(REPLACE(REPLACE(REPLACE(REPLACE(p.phone, "-", ""), ".", ""), "+", ""), " ", ""), 10))
  AND o.created > p.created
  AND o.created < (DATE_ADD(p.created, INTERVAL 183 DAY))
  AND o.created > '2013-12-30 08:28:37'

So, traditionally we prefer JOINs in MySQL due to performance issues in older versions. However, you could also try and see what happens if you use NOT EXISTS (...) instead of LEFT JOIN ... p1.

Upvotes: 1

paubo147
paubo147

Reputation: 778

I am more familiar with Oracle, but what about indexes? They can speed up queries a lot and avoid full-scans of tables, especially at left outer joins. From the explain-output I see that there are no such indexes used.

Try to place smart indexes. Again, I worked with Oracle, but I think mySQL should also place indexes on primary and foreign keys.

Upvotes: 0

Related Questions