Mike Marks
Mike Marks

Reputation: 10139

Best way to compare two sets of data w/ SQL

What I have is a query that grabs a set of data. This query is ran at a certain time. Then, 30 minutes later, I have another query (same syntax) that runs and grabs that same set of data. Finally, I have a third query (which is the query in question) that compares both sets of data. The records it pulls out are ones that agree with: if "FEDVIP_Active" was FALSE in the first data set and TRUE in the second data set, OR "UniqueID" didn't exist in the first data set and does in the second data set AND FEDVIP_Active is TRUE. I'm questioning the performance of the query below that does the comparison. It times out after 30 minutes. Is there anything you can see that I shouldn't be doing in order to be the most efficient to run? The two identical-ish data sets I'm comparing have around a million records each.

First query that grabs the initial set of data:

select Unique_ID, First_Name, FEDVIP_Active, Email_Primary
from Master_Subscribers_Prospects

Second query is exactly the same as the first.

Then, the third query below compares the data:

select 
    a.FEDVIP_Active, 
    a.Unique_ID, 
    a.First_Name, 
    a.Email_Primary
from 
    Master_Subscribers_Prospects_1 a
inner join
    Master_Subscribers_Prospects_2 b
    on 1 = 1
where a.FEDVIP_Active = 1 and b.FEDVIP_Active = 0 or
(b.Unique_ID not in (select Unique_ID from Master_Subscribers_Prospects_1) and b.FEDVIP_Active = 1)

Upvotes: 0

Views: 4925

Answers (3)

sticky bit
sticky bit

Reputation: 37472

An inner join on 1 = 1 is a disguised cross join and the number of rows a cross join produces can grow rapidly. It's the product of the number of rows in both relations involved. For performance you want to keep intermediate results as small as possible.

Then instead of IN EXISTS is often performing better, when the number of rows of the subquery is large.

But I think you don't need IN or EXITS at all.

Assuming unique_id identifies a record and is not null, you could left join the first table to the second one on common unique_ids. Then if and only if no record for an unique_id in the second table exits the unique_id of the first table in the result of the join is null, so you can check for that.

SELECT b.fedvip_active, 
       b.unique_id, 
       b.first_name,
       b.email_primary
       FROM master_subscribers_prospects_2 b
            LEFT JOIN master_subscribers_prospects_1 a
                      ON b.unique_id = a.unique_id
       WHERE a.fedvip_active = 1 
             AND b.fedvip_active = 0
              OR a.unique_id IS NULL
                 AND b.fedvip_active = 1;

For that query indexes on master_subscribers_prospects_1 (unique_id, fedvip_active) and master_subscribers_prospects_2 (unique_id, fedvip_active) might also help to speed things up.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269953

If I understand correctly, you want all records from the second data set where the corresponding unique id in the first data set is not active (either by not existing or by having the flag set to not active).

I would suggest exists:

select a.*
from Master_Subscribers_Prospects_1 a
where a.FEDVIP_Active = 1 and
      not exists (select 1
                  from Master_Subscribers_Prospects_2 b
                  where b.Unique_ID = a.Unique_ID and
                        b.FEDVIP_Active = 1
                 );

For performance, you want an index on Master_Subscribers_Prospects_2(Unique_ID, FEDVIP_Active).

Upvotes: 1

Alen.Toma
Alen.Toma

Reputation: 4870

Doing an inner select in where sats is always bad.

Here is a same version with a left join, that might work for you.

select 
    a.FEDVIP_Active, 
    a.Unique_ID, 
    a.First_Name, 
    a.Email_Primary
from 
    Master_Subscribers_Prospects_1 a
inner join
    Master_Subscribers_Prospects_2 b on 1 = 1
left join Master_Subscribers_Prospects_1 sa on sa.Unique_ID = b.Unique_ID 
where (a.FEDVIP_Active = 1 and b.FEDVIP_Active = 0) or
(sa.Unique_ID is null and b.FEDVIP_Active = 1)

Upvotes: 0

Related Questions