Comparing all individual records from two VERY similar tables with lots of data

Question

I am developing a tool for enhancing part of the regression testing process.

What I need to achieve:

I have two tables both with about 7 thousand records. One is a snapshot from production, the other one represents the exactly same records that will be implemented after the update.

Each record contains 50-200 fields of varying length. Two (field_1 and field_2) of these fields serve as identifiers to distinguish the records. Field 1 is used ti pair a record from production to a record from update, field 2 is used to identify the form of the message.

For record from production there is always one and only one record with equal field_1 and field_2 values.

I would like to write either a query, or code that will ( in few seconds at most ) return an array in the following or similar form:

Array
(
    [0] => Array
        (
            [Production] => Array
                (
                    [id] => 83
                    [field1] => value1
                    [field2] => value2
                    [field3] => differing_val
                    [field4] => value3
                      .....
                )

            [Update] => Array
                (
                    [id] => 1
                    [field1] => value1
                    [field2] => value2
                    [field3] => some_other_different_val
                    [field4] => value3
                      .....
                )

        )
)

The problem:

I tried to take all records and build this dynamically ( execution didn't even finish after 10 minutes ), with some limit and pagination, it was better
So I wrote this sql statement, that would do part of the job for me, but even with limit for some values it takes 5-7 minutes to finish (and it only does part of the job for me)

Here is my sql query:

SELECT production_records.* FROM production_records
WHERE production_records.token_2 = 
(
    SELECT update_records.token_2 FROM update_records
    WHERE 
        update_records.token_1 = production_records.token_1 AND 
        update_records.token_2 = production_records.token_2 AND 
        update_records.token_130 <> production_records.token_130  
)
ORDER BY production_records.token_1 DESC
LIMIT 6

Here I know there are 6 differences, therefore the limit and still it takes 406 seconds.

Question:

Do you see anything, that could be done with the data, or sql to make it run faster? I'm not that good with sql yet, so I guess there lies the problem.

I have full control over the data, so I can change anything, didn't come up yet with a better than O(n^2) solution.

vogomatix · Accepted Answer

What you are probably missing are indexes on anything you a comparison on. e.g. token_1, token_2, token_130. See http://dev.mysql.com/doc/refman/5.0/en/create-index.html

CREATE INDEX p_1 ON production_records (token_1);
CREATE INDEX p_2 ON production_records (token_2);
CREATE INDEX p_130 ON production_records (token_130);

CREATE INDEX u_1 ON update_records (token_1);
CREATE INDEX u_2 ON update_records (token_2);
CREATE INDEX u_130 ON update_records (token_130);

I also recommend restructuring your query and phrasing it in terms of joining tables instead of subqueries, as suggested in one of the comments. Something like:

SELECT p.* FROM production_records p
LEFT JOIN update_records u
ON u.token_1 = p.token_1 AND u.token_2 = p.token_2 AND u.token_130 <> p.token_130
ORDER BY p.token_1 DESC

The same query can be phrased as

SELECT p.* FROM production_records p, update_records u
WHERE u.token_1 = p.token_1 AND u.token_2 = p.token_2 AND u.token_130 <> p.token_130
ORDER BY p.token_1 DESC

Both queries are more or less equivalent.

Comparing all individual records from two VERY similar tables with lots of data

Answers (1)

Related Questions