Andy
Andy

Reputation: 1293

How to speed up SQL query with JOIN on large varchar field and a NOT EXISTS

I have this query that's taking forever to run. The table contains about 7 million rows. Everything else I'm doing with it (it's a "temporary" permanent table) is going relatively quickly (an hour or so) while this one UPDATE belong took 7 hours! We have SQL Server 2014.

DOI is an NVARCHAR(72) and has a non-unique CLUSTERED index on it. Affiliations is a VARCHAR(8000). I'm not really allowed to change these data types. Affiliations has an index on it as an include. We couldn't do a "regular" index since the field is so big.

CREATE NONCLUSTERED INDEX IX_Affiliations 
    ON TempSourceTable (DOI) INCLUDE (Affiliations);

What the statement below does is set a bit field to 1 if all the records for a DOI have the same value in their Affiliations column. This table has multiple records per DOI value, and we want to know if the Affiliations column is the same for all of the records with that same DOI or not.

Is there any way I can speed this up, by writing a different query, a different index or am I going about this all wrong?

UPDATE S
SET AffiliationsSameForAllDOI = 1
FROM TempSourceTable S
WHERE NOT EXISTS (SELECT 1 
                  FROM TempSourceTable S2 
                  WHERE S2.DOI = S.DOI 
                    AND S2.Affiliations <> S.Affiliations)

Upvotes: 1

Views: 2468

Answers (3)

dsz
dsz

Reputation: 5202

I'd hope this to perform better than the other offerings, since it should execute in a single scan over the index. Also, the min/max 'trick' avoids having to collect and maintain each distinct value.

WITH X AS 
(
  SELECT *,
    AffiliationsSameForAllDOI_New =
      CASE WHEN MAX(Affiliations) OVER (PARTITION BY DOI)
        = MIN(Affiliations) OVER (PARTITION BY DOI)
      THEN 1
      ELSE 0
      END
  FROM TempSourceTable 
)
UPDATE X
SET AffiliationsSameForAllDOI = AffiliationsSameForAllDOI_New
WHERE AffiliationsSameForAllDOI_New = 1

Upvotes: 0

Pரதீப்
Pரதீப்

Reputation: 93724

Here is another way

SUB-QUERY version

UPDATE TempSourceTable
SET    AffiliationsSameForAllDOI = 1
WHERE  doi IN (SELECT doi
               FROM   TempSourceTable S
               GROUP  BY DOI
               HAVING COUNT(DISTINCT Affiliations) = 1) 

EXISTS Version

UPDATE TempSourceTable S
SET    AffiliationsSameForAllDOI = 1
WHERE EXISTS  (SELECT 1
               FROM   TempSourceTable S1
               Where s1.DOI = s.DOI
               HAVING COUNT(DISTINCT Affiliations) = 1) 

INNER JOIN Version

UPDATE S 
SET    AffiliationsSameForAllDOI = 1
FROM TempSourceTable S
INNER JOIN (SELECT doi
            FROM   TempSourceTable 
            GROUP  BY DOI
            HAVING COUNT(DISTINCT Affiliations) = 1) S1
        ON S.DOI = S1.DOI

Upvotes: 6

shawnt00
shawnt00

Reputation: 17915

update TempSourceTable
set AffiliationsSameForAllDOI = 1
where DOI in (
    select DOI
    from TempSourceTable
    group by DOI
    having count(distinct Affiliations) = 1
)

Depending on what your data looks like maybe you'd have some luck with performance by creating a computed column that strips out say the first 16 characters from Affiliations or just using the checksum() and then indexing on that column instead. Perhaps it would look something like this:

update TempSourceTable
set AffiliationsSameForAllDOI = 1
where DOI in (
    select DOI
    from TempSourceTable
    where DOI in (
        select DOI
        from TempSourceTable
        group by DOI
        having count(distinct AffiliationsChecksum) = 1
    )
    group by DOI
    having count(distinct Affiliations) = 1
)

Upvotes: 4

Related Questions