Serdar
Serdar

Reputation: 797

Performance - Select query with left join and null check

I have two different tables which are called as Processing (30M records for now) and EtlRecord (4.3M records for now). As the name of tables suggest, these tables will be used for normalization of data with ETL.

We are trying to process records with batches where we have 1000 records in each batch.

SELECT TOP 1000 P.StreamGuid
FROM [staging].[Processing] P           (NOLOCK)
LEFT JOIN [core].[EtlRecord] E          (NOLOCK)    ON E.StreamGuid = P.StreamGuid
WHERE E.StreamGuid IS NULL
AND P.CompleteDate IS NOT NULL
AND P.StreamGuid IS NOT NULL

Execution of this query takes around 20 seconds now. And we are expecting to have more and more data especially in EtlRecord table. To be able to improve the performance of this query I check the actual execution plan which I shared below.

Execution plan

As you can see, the most time consuming part is index seek to determine null records in EtlRecord table. I have tried several changes but couldn't able to improve it.

Additional notes

Any suggestions to improve this query will be really helpful.

Upvotes: 0

Views: 1985

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1269503

I would suggest writing this as:

SELECT TOP 1000 P.StreamGuid
FROM [staging].[Processing] P
WHERE P.CompleteDate IS NOT NULL AND
      P.StreamGuid IS NOT NULL AND
      NOT EXISTS (SELECT 1
                  FROM [core].[EtlRecord] E 
                  WHERE E.StreamGuid = P.StreamGuid
                 );

I removed the NOLOCK directive. Only use it if you really know what you are doing -- and are prepared to read invalid data.

Then you definitely want an index on EtlRecord(StreamGuid).

You probably also want an index on Processing(CompleteDate, StreamGuid). This is at least a covering index for the query.

Upvotes: 0

Alan Burstein
Alan Burstein

Reputation: 7918

First, DDL and easily consumable sample data, like below, will help a great deal. You can copy/paste my solutions and run them locally to see what I'm talking about.

IF OBJECT_ID('tempdb..#processing','U') IS NOT NULL DROP TABLE #processing;
IF OBJECT_ID('tempdb..#EtlRecord','U')  IS NOT NULL DROP TABLE #EtlRecord;

SELECT TOP (100) 
  StreamGuid   = NEWID(),
  CompleteDate = CASE WHEN CHECKSUM(NEWID())%3 < 2 THEN GETDATE() END
INTO #processing
FROM sys.all_columns AS a

SELECT TOP (80) p.StreamGuid
INTO   #EtlRecord
FROM   #Processing AS p;

ALTER TABLE #processing ALTER COLUMN StreamGuid UNIQUEIDENTIFIER NOT NULL;
ALTER TABLE #EtlRecord  ALTER COLUMN StreamGuid UNIQUEIDENTIFIER NOT NULL;
GO
ALTER TABLE #processing ADD CONSTRAINT pk_processing PRIMARY KEY CLUSTERED(StreamGuid);
ALTER TABLE #etlRecord  ADD CONSTRAINT pk_etlRecord  PRIMARY KEY CLUSTERED(StreamGuid);
GO

Next understand that, without an ORDER BY clause, your query is not guaranteed to return the same records each time. For example, if SQL Server picks a parallel execution plan you will definitely get a different rows. I have also seen cases where including the ORDER BY will actually improve performance.

With that in mind, not that this...

SELECT --TOP 1000 
  P.StreamGuid
FROM      #processing AS p
LEFT JOIN #etlRecord  AS e ON e.StreamGuid = p.StreamGuid
WHERE     e.StreamGuid   IS NOT NULL
AND       P.CompleteDate IS NOT NULL

... will return the exact same thing as this:

SELECT TOP 1000 
  P.StreamGuid
FROM      #processing AS p
JOIN      #etlRecord  AS e ON e.StreamGuid = p.StreamGuid
WHERE     p.CompleteDate IS NOT NULL;

note that WHERE e.StreamGuid = p.StreamGuid already implies that both values are NOT NULL. Note that this query...

DECLARE @X INT;
SELECT  AreTheyEqual = IIF(@X=@X,'Yep','Nope');

... returns:

AreTheyEqual
------------
Nope

I agree with the solution @gotqn posted about the filtered index. Using my sample data, you can add something like this:

CREATE NONCLUSTERED INDEX nc_processing ON #processing(CompleteDate,StreamGuid)
    WHERE CompleteDate IS NOT NULL;

Then you can add an ORDER BY CompleteDate to the query to coerce the optimizer into choosing it that index (on my system it doesn't pick the index unless I add an ORDER BY). The ORDER BY will make you query deterministic and more predictable.

Upvotes: 1

gotqn
gotqn

Reputation: 43636

Well, in your query you need to get records from [staging].[Processing] which has not got corresponding record in the [core].[EtlRecord].

You can remove the proceeded records, first.

DELETE [staging].[Processing]
FROM [staging].[Processing] P           
INNER JOIN [core].[EtlRecord] E
    ON E.StreamGuid = P.StreamGuid;

You can use deletion on batches if you need. Removing this records will simplify our initial query and the nasty join by uniqueidentifier. You simply need to do then something like this for each batch:

SELECT TOP 1000 StreamGuid
INTO #buffer
FROM [staging].[Processing]
WHERE CompleteDate IS NOT NULL
    AND StreamGuid IS NOT NULL;

-- do whatevery you need with this records 

DELETE FROM [staging].[Processing]
WHERE StreamGuid IN (SELECT StreamGuid FROM #buffer);

Also, you have said that you have all indexes created but indexes suggested by the execution plan are not always best. This part here:

WHERE CompleteDate IS NOT NULL
    AND StreamGuid IS NOT NULL;

seems like very good candidate for filtered index especially if large amount of the rows has a NULL value for one of this columns.

Upvotes: 1

Related Questions