Matt
Matt

Reputation: 26971

Indexing to check whether a very large table contains an exact string in a certain column

I have a very large SQL table (~500 million rows)

CREATE TABLE [dbo].[TestDefinition]
(
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Name] [varchar](500) NOT NULL,
    [TeamId] [int] NOT NULL,

    CONSTRAINT [PK_Test] 
        PRIMARY KEY CLUSTERED ([Id] ASC)
) ON [PRIMARY]
GO

I want to insert new records in the table but first I need to make sure the record doesn't already exist.

I essentially build a table with the values (@TestDefinitionInput - a TVP) I want to insert and then I pass it into a stored procedure.

WITH W AS 
(
   SELECT [Name], TeamId
   FROM @TestDefinitionInput
),
X AS 
(
   SELECT W.* 
   FROM W
   LEFT JOIN TestDefinition td ON td.TeamId = W.TeamId AND td.[Name] = W.[Name] 
   WHERE td.Id IS NULL
)
INSERT INTO TestDefinition ([Name], [TeamId]) 
   SELECT [Name], [TeamId]
   FROM X;

My problem is that this is really slooow with 500 million rows. I really am not very proficient at SQL and am wondering how I should, if I should, index dbo.[TestDefinition] to make this as fast as possible.

Upvotes: 1

Views: 38

Answers (1)

Mitch Wheat
Mitch Wheat

Reputation: 300549

You have a couple of possibilities:

You could consider changing your primary key to:

CONSTRAINT [PK_Test] 
    PRIMARY KEY CLUSTERED (TeamId ASC, [Id] ASC)

At the expense of widening the clustered index attached to the primary key, you won't burn your range searching index on the Identity column. This would require periodic defraging

The other (probably preferred) option is to create a non clustered index:

CREATE INDEX IX_TestDefinition_TeamId_Name 
    ON dbo.TestDefinition(TeamId, Name)

[Note: no need to explicitly INCLUDE Id as it's the clustering key, and that is added to every non-clustered index.]

I would also consider placing the incoming TVP parameter values into a temporary table rather than a table variable, and join to that table (even creating a corresponding index on it). Table vars are notorious for poor cardinality estimates.

I would also consider adding the index with page compression (and the table if you can)

.... with (data_compression = page);

Data compression is not available in every edition of SQL Server. SQL Server 2016 onwards, it's in Standard edition.

The other thing you should consider is to perform the inserts in batches of say 50K or 100K at a time, with a short sleep in-between. This can prevent massive log file growth and contention, and allow other processes to have a chance to access the table.

Upvotes: 2

Related Questions