Bala Sambandam
Bala Sambandam

Reputation: 83

Large table, slow query

This is for a LAMP project. For the purpose of illustration I will use a simplified problem:

create table table1 (
  id int unsigned primary key,
  mail_zip varchar(9),
  index (mail_zip(5))
);

create table table2 (
  name varchar(255),
  zip varchar(5)
);

select table1.id from table1
where substring(mail_zip, 1, 5) in
  (select zip from table2 where name = 'test');

Table1 contains over 5 million records with a 9 digit zip code. Table 2 usually has less than 10 rows for a particular table2.name and uses only 5 digit zip codes. This query takes an unacceptably long time. In my actual code table1 is a copy of a national database containing over 100 columns. I would like to try to maintain equivalence between that table and the national database so I would like to avoid adding columns or shortening the zip to 5 digits; however, my default plan is to just limit mail_zip to the first 5 digits on insert to avoid the use of substring() which I think is the problem unless someone has a better idea.

EDIT: Unfortunately, most of the suggestions below did not result in any noticable improvement except for sticky bit's. My query as originally written takes a little over 3 minutes. So did most of the other suggestions. sticky bit's dropped the time down to 3.5 sec. Truncating the mail_zip field in table1 to 5 digits dropped query time down to 0.06sec. While I would have liked my local table to match the national database exactly it's hard for me to see any practical loss of function in my application by just dropping the last 4 digits of the zip code so that is the way I will go.

Upvotes: 0

Views: 71

Answers (4)

sticky bit
sticky bit

Reputation: 37472

You can try to use an INNER JOIN with a LIKE.

SELECT DISTINCT
       table1.id
       FROM table1
            INNER JOIN table2
                       ON table1.mail_zip LIKE concat(table2.zip, '%')
            WHERE table2.name = 'test';

That shifts the use of a function on to the column of the smaller table.

To support that, also create a compound index on table1 (mail_zip, id) (do not limit the mail_zip).

CREATE INDEX table1_mail_zip_id
             ON table1
                (mail_zip,
                 id);

And maybe another index on table2 (name, zip). Though I guess that one won't make much of a difference if there's only 10 rows in table2.

CREATE INDEX table2_name_zip
             ON table2
                (name,
                 zip);

Like that the index on table1 might get picked up (It did in my tests but I have no data, so that doesn't say much). There is a penalty for the DISTINCT though, but I'd expect the use of the index outweighs that significantly.

Upvotes: 0

Rahmat Ihsan
Rahmat Ihsan

Reputation: 326

try this

select table1.id from table1
INNER JOIN table2
ON table1.mail_zip LIKE CONCAT(table2.zip,'%')
WHERE name = 'test';

Upvotes: 0

The Impaler
The Impaler

Reputation: 48769

The problem is in the "left side expression" on the filter.

substring(mail_zip, 1, 5) in ...

In general an expression on the left side of the equality can/will defeat the usage of the index. The typical solution is to rephrase the query, but in your case you can't. That simple solution is not at hand.

Nevertheless, if you are running MySQL 5.7 or newer, there's a workaround that is quite fast:

  1. Add a virtual column to the table that computes the 5-digit zip value.

  2. Create an index on the virtual column.

  3. Modify your query to use the virtual column instead of the original column.

Here's the example:

alter table table1 add zip5 varchar(5) 
  generated always as (substring(mail_zip, 1, 5)) virtual;

create index ix1_table1 on table1 (zip5);

select table1.id from table1
where zip5 in
  (select zip from table2 where name = 'test');

Upvotes: 2

Error_2646
Error_2646

Reputation: 3781

You could try rewriting as

select table1.id 
  from table1 t1
 where exists
       ( SELECT 1
           FROM table2 t2
          WHERE substring(t1.mail_zip,1,5) = t2.zip
            AND t2.zip
       );

Writing it as an exists or a join might actually get that index used.

Generally speaking if you are having to do functions in conditions

ex. substring(t1.mail_zip,1,5) = t2.zip

its a sign your model has room for improvement.

Upvotes: 0

Related Questions