Reputation: 4117
I am running a SQL query on a table containing 3 million records comparing email addresses.
We have two email address fields, primary and secondary.
I am comparing a subset of primary emails against all other primary and secondary Emails to get a count of both duplicates and unique Emails in the data.
I believe this code works, its still running 10 mins in, and I have to do this for another 9 subsets which are alot larger than this one. Code is as follows:
SELECT COUNT(*) AS UniqueRecords
FROM AllVRContacts
WHERE LEN(EMAIL) > 1 AND ACCOUNTID = '00120000003bNmMAAU'
AND EMAIL NOT IN
(SELECT EMAIL FROM AllVRContacts WHERE ACCOUNTID != '00120000003bNmMAAU')
AND EMAIL NOT IN
(SELECT SECONDARY_EMAIL_ADDRESS__C FROM AllVRContacts WHERE ACCOUNTID != '00120000003bNmMAAU')
I want to learn something from this rather than just have someone scratch my back for me, the more explanation the better!
Thanks guys,
Upvotes: 1
Views: 183
Reputation: 77667
Can this be applicable?
SELECT ACCOUNTID, COUNT(*) AS UniqueRecords
FROM (
SELECT ACCOUNTID, EMAIL
FROM AllVRContacts
WHERE ACCOUNTID = '00120000003bNmMAAU' AND LEN(EMAIL) > 1
UNION
SELECT ACCOUNTID, SECONDARY_EMAIL_ADDRESS__C
FROM AllVRContacts
WHERE ACCOUNTID = '00120000003bNmMAAU' AND LEN(SECONDARY_EMAIL_ADDRESS__C) > 1
) s
I understood that basically you wanted to count distinct email addresses for each ACCOUNTID.
UNION in the inner query eliminates duplicates so the output (of the inner query) only has distinct pairs of account ids and emails, whether primary or secondary. Particularly this means that if an email address is stored as both primary and secondary, it will count only once. Same applies to same primary or same secondary address stored in different rows.
Now you only need to count the rows, which is done by the outer query.
If another 9 subsets you've mentioned mean simply other ACCOUNTIDs, then maybe you could try GROUP BY ACCOUNTID
applied to the outer query and the ACCOUNTID = '...'
part of both WHERE
clauses got rid of to count emails for all of them with one query. That is, like this:
SELECT ACCOUNTID, COUNT(*) AS UniqueRecords
FROM (
SELECT ACCOUNTID, EMAIL
FROM AllVRContacts
WHERE LEN(EMAIL) > 1
UNION
SELECT ACCOUNTID, SECONDARY_EMAIL_ADDRESS__C
FROM AllVRContacts
WHERE LEN(SECONDARY_EMAIL_ADDRESS__C) > 1
) s
GROUP BY ACCOUNTID
Upvotes: 1
Reputation: 1547
Try this and let me know
SELECT ACCOUNTID,COUNT(*) AS UniqueRecords
FROM AllVRContacts
WHERE LEN(EMAIL) > 1 AND ACCOUNTID = '00120000003bNmMAAU'
Group by ACCOUNTID
Having COUNT(EMAIL) >1
Upvotes: 0
Reputation: 3594
SELECT COUNT(*)
FROM (SELECT EMAIL AS UniqueRecords
FROM AllVRContacts a
WHERE ACCOUNTID = '00120000003bNmMAAU'
AND NOT EXISTS (SELECT EMAIL FROM AllVRContacts b
WHERE ACCOUNTID != '00120000003bNmMAAU'
AND (
a.EMAIL = b.EMAIL
OR a.EMAIL = b.SECONDARY_EMAIL_ADDRESS__C
)
)
AND LEN(EMAIL) > 1
GROUP BY EMAIL
) c
So how is this query better?
You typically want to use NOT EXISTS instead of NOT IN
IN
returns true if a specified value matches any value in a subquery or a list
EXISTS
returns true if a subquery contains any rows
More Info: SQL Server: JOIN vs IN vs EXISTS - the logical difference
= performs much better than !=
Reduce the scans (seeks if you have indexes on AllVRContacts) by not searching through AllVRContacts a second time for the secondary e-mail comparison
GROUP BY
resolves potential duplicate e-mails within the ACCOUNTID
To further improve performance, add indexes as Quassnoi suggested and whatever is populating the table should validate e-mails to remove the need for the LEN check.
[EDIT] Added explanation to (3)
Upvotes: 1
Reputation: 425341
Create the following indexes:
AllVrContacts (AccountID) INCLUDE (Email)
AllVrContacts (Email) INCLUDE (AccountID)
AllVrContacts (SECONDARY_EMAIL_ADDRESS__C) INCLUDE (AccountID)
The index on (AccountID, Email)
will be used for the WHERE
filter in the main query:
WHERE ACCOUNTID = '00120000003bNmMAAU'
AND LEN(Email) > 1
The other two indexes will be used for antijoins (NOT IN
) against this table.
You should also use:
SELECT COUNT(DISTINCT email) AS UniqueRecords
if you want the duplicates across the same account to be counted only once.
Upvotes: 2