Reputation: 3445
I am developing a tool for enhancing part of the regression testing process.
What I need to achieve:
I have two tables both with about 7 thousand records. One is a snapshot from production, the other one represents the exactly same records that will be implemented after the update.
Each record contains 50-200 fields of varying length. Two (field_1 and field_2) of these fields serve as identifiers to distinguish the records. Field 1 is used ti pair a record from production to a record from update, field 2 is used to identify the form of the message.
For record from production there is always one and only one record with equal field_1 and field_2 values.
I would like to write either a query, or code that will ( in few seconds at most ) return an array in the following or similar form:
Array
(
[0] => Array
(
[Production] => Array
(
[id] => 83
[field1] => value1
[field2] => value2
[field3] => differing_val
[field4] => value3
.....
)
[Update] => Array
(
[id] => 1
[field1] => value1
[field2] => value2
[field3] => some_other_different_val
[field4] => value3
.....
)
)
)
The problem:
Here is my sql query:
SELECT production_records.* FROM production_records
WHERE production_records.token_2 =
(
SELECT update_records.token_2 FROM update_records
WHERE
update_records.token_1 = production_records.token_1 AND
update_records.token_2 = production_records.token_2 AND
update_records.token_130 <> production_records.token_130
)
ORDER BY production_records.token_1 DESC
LIMIT 6
Here I know there are 6 differences, therefore the limit and still it takes 406 seconds.
Question:
Do you see anything, that could be done with the data, or sql to make it run faster? I'm not that good with sql yet, so I guess there lies the problem.
I have full control over the data, so I can change anything, didn't come up yet with a better than O(n^2) solution.
Upvotes: 1
Views: 122
Reputation: 5041
What you are probably missing are indexes on anything you a comparison on. e.g. token_1, token_2, token_130. See http://dev.mysql.com/doc/refman/5.0/en/create-index.html
CREATE INDEX p_1 ON production_records (token_1);
CREATE INDEX p_2 ON production_records (token_2);
CREATE INDEX p_130 ON production_records (token_130);
CREATE INDEX u_1 ON update_records (token_1);
CREATE INDEX u_2 ON update_records (token_2);
CREATE INDEX u_130 ON update_records (token_130);
I also recommend restructuring your query and phrasing it in terms of joining tables instead of subqueries, as suggested in one of the comments. Something like:
SELECT p.* FROM production_records p
LEFT JOIN update_records u
ON u.token_1 = p.token_1 AND u.token_2 = p.token_2 AND u.token_130 <> p.token_130
ORDER BY p.token_1 DESC
The same query can be phrased as
SELECT p.* FROM production_records p, update_records u
WHERE u.token_1 = p.token_1 AND u.token_2 = p.token_2 AND u.token_130 <> p.token_130
ORDER BY p.token_1 DESC
Both queries are more or less equivalent.
Upvotes: 1