Reputation: 812

faster way to use sets in MySQL

I have a MySQL 5.1 InnoDB table (customers) with the following structure:

int         record_id (PRIMARY KEY)
int         user_id (ALLOW NULL)
varchar[11] postcode (ALLOW NULL)
varchar[30] region (ALLOW NULL)
..
..
..

There are roughly 7 million rows in the table. Currently, the table is being queried like this:

SELECT * FROM customers WHERE user_id IN (32343, 45676, 12345, 98765, 66010, ...

in the actual query, currently over 560 user_ids are in the IN clause. With several million records in the table, this query is slow!

There are secondary indexes on table, the first of which being on user_id itself, which I thought would help.

I know that SELECT(*) is A Bad Thing and this will be expanded to the full list of fields required. However, the fields not listed above are more ints and doubles. There are another 50 of those being returned, but they are needed for the report.

I imagine there's a much better way to access the data for the user_ids, but I can't think how to do it. My initial reaction is to remove the ALLOW NULL on the user_id field, as I understand NULL handling slows down queries?

I'd be very grateful if you could point me in a more efficient direction than using the IN ( ) method.

EDIT Ran EXPLAIN, which said:

select_type = SIMPLE 
table = customers 
type = range 
possible_keys = userid_idx 
key = userid_idx 
key_len = 5 
ref = (NULL) 
rows = 637640 
Extra = Using where

does that help?

Upvotes: 1

Answers (5)

Eoin Campbell

Reputation: 44308

Are they the same ~560 id's every time? Or is it a different ~500 ids on different runs of the queries?

You could just insert your 560 UserIDs into a separate table (or even a temp table), stick an index on the that table and inner join it to you original table.

Upvotes: 1

MarkR

Reputation: 63596

"Select *" is not as bad as some people think; row-based databases will fetch the entire row if they fetch any of it, so in situations where you're not using a covering index, "SELECT *" is essentially no slower than "SELECT a,b,c" (NB: There is sometimes an exception when you have large BLOBs, but that is an edge-case).

First things first - does your database fit in RAM? If not, get more RAM. No, seriously. Now, suppose your database is too huge to reasonably fit into ram (Say, > 32Gb) , you should try to reduce the number of random I/Os as they are probably what's holding things up.

I'll assuming from here on that you're running proper server grade hardware with a RAID controller in RAID1 (or RAID10 etc) and at least two spindles. If you're not, go away and get that.

You could definitely consider using a clustered index. In MySQL InnoDB you can only cluster the primary key, which means that if something else is currently the primary key, you'll have to change it. Composite primary keys are ok, and if you're doing a lot of queries on one criterion (say user_id) it is a definite benefit to make it the first part of the primary key (you'll need to add something else to make it unique).

Alternatively, you might be able to make your query use a covering index, in which case you don't need user_id to be the primary key (in fact, it must not be). This will only happen if all of the columns you need are in an index which begins with user_id.

As far as query efficiency is concerned, WHERE user_id IN (big list of IDs) is almost certainly the most efficient way of doing it from SQL.

BUT my biggest tips are:

Have a goal in mind, work out what it is, and when you reach it, stop.
Don't take anybody's word for it - try it and see
Ensure that your performance test system is the same hardware spec as production
Ensure that your performance test system has the same data size and kind as production (same schema is not good enough!).
Use synthetic data if it is not possible to use production data (Copying production data may be logistically difficult (Remember your database is >32Gb) ; it may also violate security policies).
If your query is optimal (as it probably already is), try tuning the schema, then the database itself.

Upvotes: 2

Quassnoi

Reputation: 425723

First, check if there is an index on USER_ID and make sure it's used.

You can do it with running EXPLAIN.

Second, create a temporary table and use it in a JOIN:

CREATE TABLE temptable (user_id INT NOT NULL)

SELECT  *
FROM    temptable t
JOIN    customers c
ON      c.user_id = t.user_id

Third, how may rows does your query return?

If it returns almost all rows, then it just will be slow, since it will have to pump all these millions over the connection channel, to begin with.

NULL will not slow your query down, since the IN condition only satisfies non-NULL values which are indexed.

Update:

The index is used, the plan is fine except that it returns more than half a million rows.

Do you really need to put all these 638,000 rows into the report?

Hope its not printed: bad for rainforests, global warming and stuff.

Speaking seriously, you seem to need either aggregation or pagination on your query.

Upvotes: 3

Jeff Meatball Yang

Reputation: 39057

Is this your most important query? Is this a transactional table?

If so, try creating a clustered index on user_id. Your query might be slow because it still must make random disk reads to retrieve the columns (key lookups), even after finding the records that match (index seek on the user_Id index).

If you cannot change the clustered index, then you might want to consider an ETL process (simplest is a trigger that inserts into another table with the best indexing). This should yield faster results.

Also note that such large queries may take some time to parse, so help it out by putting the queried ids into a temp table if possibl

Upvotes: 1

Eric Hogue

Reputation: 8930

You can try to insert the ids you need to query on in a temp table and inner join both tables. I don't know if that would help.

Upvotes: 0

faster way to use sets in MySQL

Answers (5)

Related Questions