Set-based bulk import of denormalized data into normalized SQL Server 2014 database tables

Question

The following simplified mock-up works fine to bulk/set based insert the denormalised data in #BulkData (improvement suggestions welcome):

IF OBJECT_ID('tempdb..#Things') IS NOT NULL 
   DROP TABLE #Things

IF OBJECT_ID('tempdb..#Categories') IS NOT NULL 
   DROP TABLE #Categories

IF OBJECT_ID('tempdb..#ThingsToCategories') IS NOT NULL 
   DROP TABLE #ThingsToCategories

IF OBJECT_ID('tempdb..#BulkData') IS NOT NULL 
   DROP TABLE #BulkData

CREATE TABLE #Things
(
    ThingId INT IDENTITY(1,1) PRIMARY KEY,
    ThingName NVARCHAR(255)
)

CREATE TABLE #Categories
(
    CategoryId INT IDENTITY(1,1) PRIMARY KEY,
    CategoryName NVARCHAR(255)
)

CREATE TABLE #ThingsToCategories
(
    ThingId INT,
    CategoryId INT
)

CREATE TABLE #BulkData
(
    ThingName NVARCHAR(255),
    CategoryName NVARCHAR(255)
)

-- the following would be done from a flat file via a bulk import 
INSERT INTO #BulkData
    SELECT N'Thing1', N'Category1'
        UNION 
    SELECT N'Thing2', N'Category1'
        UNION 
    SELECT N'Thing3', N'Category2'

INSERT INTO #Categories
    SELECT DISTINCT CategoryName 
    FROM #BulkData 
    WHERE CategoryName NOT IN (SELECT DISTINCT CategoryName 
                               FROM #Categories)

INSERT INTO #Things
    SELECT DISTINCT ThingName 
    FROM #BulkData 
    WHERE ThingName NOT IN (SELECT DISTINCT ThingName FROM #Things)

INSERT INTO #ThingsToCategories
    SELECT ThingId, CategoryId
    FROM #BulkData 
    INNER JOIN #Things ON #BulkData.ThingName = #Things.ThingName
    INNER JOIN #Categories ON #BulkData.CategoryName = #Categories.CategoryName

SELECT * FROM #Categories
SELECT * FROM #Things
SELECT * FROM #ThingsToCategories

One issue I have with the above, is that data in #Things would be accessible before data are inserted into #ThingsToCategories.

Can I wrap the above in a transaction (?) to only make #Things available when the whole bulk import has finished?

Like so:

BEGIN TRANSACTION X
 -- insert into all normalised tables
COMMIT TRANSACTION X

Does this work with a couple of million records though?

I guess one could also lower the logging level?

TT. · Accepted Answer

Can I wrap the above in a transaction (?) to only make #Things available when the whole bulk import has finished? Like so:

BEGIN TRANSACTION X
 -- insert into all normalised tables
COMMIT TRANSACTION X

The answer is Yes. From the Documentation on Transactions:

A transaction is a single unit of work. If a transaction is successful, all of the data modifications made during the transaction are committed and become a permanent part of the database. If a transaction encounters errors and must be canceled or rolled back, then all of the data modifications are erased.

Transactions have the following four standard properties, usually referred to by the acronym ACID. Quoting from the following link on tutorialspoint.com on SQL Transactions:

Atomicity: ensures that all operations within the work unit are completed successfully; otherwise, the transaction is aborted at the point of failure, and previous operations are rolled back to their former state.

Consistency: ensures that the database properly changes states upon a successfully committed transaction.

Isolation: enables transactions to operate independently of and transparent to each other.

Durability: ensures that the result or effect of a committed transaction persists in case of a system failure.

Will this work with a few million entries?

Again, Yes. The amount of entries is irrelevant. In my own words this time:

Atomicity: If a transaction succeeds all the operations within the transaction will take effect as soon as the transaction completes, i.e. at the time the transaction is committed. If at least one of the operations in the transaction fails, all operations are rolled back (in other words, none take hold). The amount of operations within the transaction is irrelevant.
Isolation: Other transactions won't see operations of other transactions, unless they are committed.

There are however different Transaction Isolation Levels. The default for SQL Server is READ COMMITTED:

Specifies that statements cannot read data that has been modified but not committed by other transactions. [...]

This is a trade-off level for balance between performance and consistency. Ideally you would want everything SERIALIZABLE (see documentation, too long to copy/paste). This isolation level trades performance(-) for consistency(+). In a lot of cases the READ COMMITTED isolation level is good enough, but you should be aware of how it works and put that against how your transaction is supposed to work vis-à-vis completions of other transactions.

Note also that a transaction will put locks on database objects (rows, table, schema...) and that other transactions will block if they want to read or modify those objects (depending on the type of lock). For that reason it is preferable to keep the amount of operations within a transaction low. Sometimes though, transactions just do a lot of things and they can't be broken up.

Set-based bulk import of denormalized data into normalized SQL Server 2014 database tables

Answers (1)

Related Questions