Reputation: 4117

Can I make this T-SQL code more efficient

I am running a SQL query on a table containing 3 million records comparing email addresses.

We have two email address fields, primary and secondary.

I am comparing a subset of primary emails against all other primary and secondary Emails to get a count of both duplicates and unique Emails in the data.

I believe this code works, its still running 10 mins in, and I have to do this for another 9 subsets which are alot larger than this one. Code is as follows:

SELECT COUNT(*) AS UniqueRecords
FROM AllVRContacts 
WHERE LEN(EMAIL) > 1 AND ACCOUNTID = '00120000003bNmMAAU'

AND EMAIL NOT IN
(SELECT EMAIL FROM AllVRContacts WHERE ACCOUNTID != '00120000003bNmMAAU')

AND EMAIL NOT IN
(SELECT SECONDARY_EMAIL_ADDRESS__C FROM AllVRContacts WHERE ACCOUNTID != '00120000003bNmMAAU')

I want to learn something from this rather than just have someone scratch my back for me, the more explanation the better!

Thanks guys,

Upvotes: 1

Answers (4)

Andriy M

Reputation: 77667

Can this be applicable?

SELECT ACCOUNTID, COUNT(*) AS UniqueRecords
FROM (
  SELECT ACCOUNTID, EMAIL
  FROM AllVRContacts
  WHERE ACCOUNTID = '00120000003bNmMAAU' AND LEN(EMAIL) > 1
  UNION
  SELECT ACCOUNTID, SECONDARY_EMAIL_ADDRESS__C
  FROM AllVRContacts
  WHERE ACCOUNTID = '00120000003bNmMAAU' AND LEN(SECONDARY_EMAIL_ADDRESS__C) > 1
) s

I understood that basically you wanted to count distinct email addresses for each ACCOUNTID.

UNION in the inner query eliminates duplicates so the output (of the inner query) only has distinct pairs of account ids and emails, whether primary or secondary. Particularly this means that if an email address is stored as both primary and secondary, it will count only once. Same applies to same primary or same secondary address stored in different rows.

Now you only need to count the rows, which is done by the outer query.

If another 9 subsets you've mentioned mean simply other ACCOUNTIDs, then maybe you could try GROUP BY ACCOUNTID applied to the outer query and the ACCOUNTID = '...' part of both WHERE clauses got rid of to count emails for all of them with one query. That is, like this:

SELECT ACCOUNTID, COUNT(*) AS UniqueRecords
FROM (
  SELECT ACCOUNTID, EMAIL
  FROM AllVRContacts
  WHERE LEN(EMAIL) > 1
  UNION
  SELECT ACCOUNTID, SECONDARY_EMAIL_ADDRESS__C
  FROM AllVRContacts
  WHERE LEN(SECONDARY_EMAIL_ADDRESS__C) > 1
) s
GROUP BY ACCOUNTID

Upvotes: 1

Thanigainathan

Reputation: 1547

Try this and let me know

SELECT ACCOUNTID,COUNT(*) AS UniqueRecords FROM AllVRContacts
WHERE LEN(EMAIL) > 1 AND ACCOUNTID = '00120000003bNmMAAU'
Group by ACCOUNTID Having COUNT(EMAIL) >1

Upvotes: 0

bitxwise

Reputation: 3594

SELECT COUNT(*)
FROM   (SELECT EMAIL AS UniqueRecords
        FROM   AllVRContacts a
        WHERE  ACCOUNTID = '00120000003bNmMAAU'
               AND NOT EXISTS (SELECT EMAIL FROM AllVRContacts b
                               WHERE ACCOUNTID != '00120000003bNmMAAU'
                                     AND (
                                          a.EMAIL = b.EMAIL
                                          OR a.EMAIL = b.SECONDARY_EMAIL_ADDRESS__C
                                     )
               )
               AND LEN(EMAIL) > 1
        GROUP  BY EMAIL
) c

So how is this query better?

You typically want to use NOT EXISTS instead of NOT IN

IN returns true if a specified value matches any value in a subquery or a list

EXISTS returns true if a subquery contains any rows

More Info: SQL Server: JOIN vs IN vs EXISTS - the logical difference
= performs much better than !=
Reduce the scans (seeks if you have indexes on AllVRContacts) by not searching through AllVRContacts a second time for the secondary e-mail comparison
GROUP BY resolves potential duplicate e-mails within the ACCOUNTID

To further improve performance, add indexes as Quassnoi suggested and whatever is populating the table should validate e-mails to remove the need for the LEN check.

[EDIT] Added explanation to (3)

Upvotes: 1

Quassnoi

Reputation: 425341

Create the following indexes:

AllVrContacts (AccountID) INCLUDE (Email)
AllVrContacts (Email) INCLUDE (AccountID)
AllVrContacts (SECONDARY_EMAIL_ADDRESS__C) INCLUDE (AccountID)

The index on (AccountID, Email) will be used for the WHERE filter in the main query:

WHERE  ACCOUNTID = '00120000003bNmMAAU'
       AND LEN(Email) > 1

The other two indexes will be used for antijoins (NOT IN) against this table.

You should also use:

SELECT COUNT(DISTINCT email) AS UniqueRecords

if you want the duplicates across the same account to be counted only once.

Upvotes: 2

Can I make this T-SQL code more efficient

Answers (4)

Related Questions