Bulk insert from CSV file - skip duplicates

Question

UPDATE: Ended up using this method created by Johnny Bubriski and then modified it a bit to skip duplicates. Works like a charm and is apparently quite fast. Link: http://johnnycode.com/2013/08/19/using-c-sharp-sqlbulkcopy-to-import-csv-data-sql-server/

I have been searching for an answer to this but cannot seem to find it. I am doing a T-SQL bulk insert to load data into a table in a local database from a csv file. My statement looks like this:

BULK INSERT Orders
FROM 'csvfile.csv'
WITH(FIELDTERMINATOR = ';', ROWTERMINATOR = '0x0a', FORMATFILE = 'formatfile.fmt', ERRORFILE = 'C:\ProgramData\Tools_TextileMagazine\AdditionalFiles\BulkInsertErrors.txt')
GO

SELECT * 
FROM Orders
GO

I get an exception when I try to insert duplicate rows (for example taking the same csv file twice) which causes the entire insert to stop and rollback. Understandable enough since I am violating the primary key constraint. Right now I just show a messagebox to let users know that duplicates are present in the csv file, but this is of course not a proper solution, actually not a solution at all. My question is, is there any way to ignore these duplicate rows and just skip them and only add the rows that are not duplicate? Perhaps in a try catch somehow?

If it is not possible, what would be the "correct" (for lack of a better word) way to import data from the csv file? The exception is causing me a bit of trouble. I did read somewhere that you can set up a temporary table, load the data into it and select distinct between the two tables before inserting. But is there really no easier way to do it with bulk insert?

DoctorMick · Accepted Answer

You could set the MAXERRORS property to quite a high which will allow the valid records to be inserted and the duplicates to be ignored. Unfortunately, this will mean that any other errors in the dataset won't cause the load to fail.

Alternatively, you could set the BATCHSIZE property which will load the data in multiple transactions therefore if there are duplicates it will only roll back the batch.

A safer, but less efficient, way would be to load the CSV file in to a separate, empty, table and then merge them into your orders table as you mentioned. Personally, this is the way I'd do it.

None of these solutions are ideal but I can't think of a way of ignoring duplicates in the bulk insert syntax.

Bulk insert from CSV file - skip duplicates

Answers (2)

Related Questions