Reputation: 8627

SQL Server index behaviour when doing bulk insert

I have an application that inserts multiple rows at once into SQL Server.

I use either SqlBulkCopy class or self-written code that generates a gigantic insert into table_name(...) values (...) statement.

My table has several indexes and a clustered one.

The question is: how are those indexes updated? For each row I insert? For each transaction?

Somewhat odd question - is there a general term for this scenario, like 'bulk-insert indexing behaviour'? I tried to google several keyword combinations, haven't found anything. The reason I ask is because I sometimes do work with Postgres and would like to know its behaviour as well.

I've been trying to find an article on this topic, several times, without any luck.

If you can point me to any docs, article or a book with a relevant chapter, that'd be great

Upvotes: 7

Answers (3)

Dan Guzman

Reputation: 46233

You can see how indexes are updated by examining the query plan. Consider this heap table with only non-clustered indexes.

CREATE TABLE dbo.BulkInsertTest(
      Column1 int NOT NULL
    , Column2 int NOT NULL
    , Column3 int NOT NULL
    , Column4 int NOT NULL
    , Column5 int NOT NULL
    );
CREATE INDEX BulkInsertTest_Column1 ON dbo.BulkInsertTest(Column1);
CREATE INDEX BulkInsertTest_Column2 ON dbo.BulkInsertTest(Column2);
CREATE INDEX BulkInsertTest_Column3 ON dbo.BulkInsertTest(Column3);
CREATE INDEX BulkInsertTest_Column4 ON dbo.BulkInsertTest(Column4);
CREATE INDEX BulkInsertTest_Column5 ON dbo.BulkInsertTest(Column5);
GO

Below is the execution plan for a singleton INSERT.

INSERT INTO dbo.BulkInsertTest(Column1, Column2, Column3, Column4, Column5) VALUES
     (1, 2, 3, 4, 5);

The execution plan shows only the Table Insert operator so the new non-clustered index rows were inserted intrinsically during table insert operation itself. A large batch of singleton INSERT statements will yield this same plan for each insert statement.

I get a similar plan with a single INSERT statement with a large number of rows specified via a row constructor, with the only difference being addition of a Constant Scan operator to emit the rows.

INSERT INTO dbo.BulkInsertTest(Column1, Column2, Column3, Column4, Column5) VALUES
     (1, 2, 3, 4, 5)
    ,(1, 2, 3, 4, 5)
    ,(1, 2, 3, 4, 5)
    ,...
    ,(1, 2, 3, 4, 5);

Here's the execution plan for a T-SQL BULK INSERT statement (using a dummy empty file as the source). With the BULK INSERT, SQL Server added additional query plan operators to optimize the index inserts. The rows were spooled after inserting into the table, and then rows from the spool sorted and inserted into each index separately as a mass insert operation. This method reduces the overhead for large insert operations. You may also see similar plans for INSERT...SELECT queries.

BULK INSERT dbo.BulkInsertTest
    FROM 'c:\Temp\BulkInsertTest.txt';

I verified that SqlBulkCopy generates the same execution plan as a T-SQL BULK INSERT by capturing the actual plans with an Extended Event trace. Below is the trace DDL and PowerShell script I used.

Trace DDL:

CREATE EVENT SESSION [SqlBulkCopyTest] ON SERVER 
ADD EVENT sqlserver.query_post_execution_showplan(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[equal_i_sql_unicode_string]([sqlserver].[client_app_name],N'SqlBulkCopyTest') 
        AND [sqlserver].[like_i_sql_unicode_string]([sqlserver].[sql_text],N'insert bulk%') 
        ))
ADD TARGET package0.event_file(SET filename=N'SqlBulkCopyTest');
GO

PowerShell script:

$connectionString = "Data Source=.;Initial Catalog=YourUserDatabase;Integrated Security=SSPI;Application Name=SqlBulkCopyTest"

$dt = New-Object System.Data.DataTable;
$null = $dt.Columns.Add("Column1", [System.Type]::GetType("System.Int32"))
$null = $dt.Columns.Add("Column2", [System.Type]::GetType("System.Int32"))
$null = $dt.Columns.Add("Column3", [System.Type]::GetType("System.Int32"))
$null = $dt.Columns.Add("Column4", [System.Type]::GetType("System.Int32"))
$null = $dt.Columns.Add("Column5", [System.Type]::GetType("System.Int32"))

$row = $dt.NewRow()
[void]$dt.Rows.Add($row)
$row["Column1"] = 1
$row["Column2"] = 2
$row["Column3"] = 3
$row["Column4"] = 4
$row["Column5"] = 5

$bcp = New-Object System.Data.SqlClient.SqlBulkCopy($connectionString)
$bcp.DestinationTableName = "dbo.BulkInsertTest"
$bcp.WriteToServer($dt)

EDIT

Credit to Vladimir Baranov for providing this blog article by Microsoft Data Platform MVP Paul White, which details SQL Server's cost-based index maintenance strategy.

EDIT 2

I see from your revised question that your actual situation is a table with a clustered index rather than a heap. The plans will be similar to that of the heap examples above, except of course, that data will be inserted using a Clustered Index Insert operator instead of a Table Insert.

An ORDER hint may be specified during bulk insert operations into a table with a clustered index. When the specified order matches that of the clustered index, SQL Server can eliminate the sort operator before the Clustered Index Insert since it assumes data are already sorted per the hint.

Unfortunately, System.Data.SqlClient.SqlBulkCopy does not support an ORDER hint via that API. As @benjol mentioned in the comment, the newer Microsoft.Data.SqlClient.SqlBulkCopy includes a ColumnOrderHints property where one can specify the target table clustered index columns and sort order.

Upvotes: 7

KumarHarsh

Reputation: 5094

My table has several indexes except clustered one

That means this table contain only non clustered index. That also means this table is HEAP.

When data is inserted(single or bulk) then data is always written at the end of table or next available pages.

When data is deleted then between page become free but is not reclaim because data is always written in this end.

So there is lot Fragmentation in Heap table than Clustered Index table.

Since table also has several Non Clusetered index.

There will be automatic index rebuild after commit. Since index are ordered so there will be Index page split.

so if heavy data type like varchar(100),varchar(500) etc are indexed then index page split will happen very frequently.

Bulk Insert with Clustered Index

Upvotes: 0

Jesús López

Reputation: 9241

The question is: how are those indexes updated? For each row I insert? For each transaction?

From a low level point of view indexes are always updated row by row, this is a consequence of the indexes internal data structure. SQL Server indexes are B+ trees. There is no algorithm to update several rows in a B+ tree index all at once, you need to update them one by one because you cannot know in advance where will go one row before updating o inserting the previous rows.

However from a transactional point of view, indexes are updated all at once, this is because SQL Server implement transactional semantics. On the default isolation level READ COMMITTED, another transaction cannot see the rows (index or table rows) you inserted in the bulk insert operation until the transaction is committed. So it appears as the rows were inserted all at once.

Upvotes: 0

SQL Server index behaviour when doing bulk insert

Answers (3)

Related Questions