St0ffer
St0ffer

Reputation: 171

Bulk insert from CSV file - skip duplicates

UPDATE: Ended up using this method created by Johnny Bubriski and then modified it a bit to skip duplicates. Works like a charm and is apparently quite fast. Link: http://johnnycode.com/2013/08/19/using-c-sharp-sqlbulkcopy-to-import-csv-data-sql-server/

I have been searching for an answer to this but cannot seem to find it. I am doing a T-SQL bulk insert to load data into a table in a local database from a csv file. My statement looks like this:

BULK INSERT Orders
FROM 'csvfile.csv'
WITH(FIELDTERMINATOR = ';', ROWTERMINATOR = '0x0a', FORMATFILE = 'formatfile.fmt', ERRORFILE = 'C:\\ProgramData\\Tools_TextileMagazine\\AdditionalFiles\\BulkInsertErrors.txt')
GO

SELECT * 
FROM Orders
GO

I get an exception when I try to insert duplicate rows (for example taking the same csv file twice) which causes the entire insert to stop and rollback. Understandable enough since I am violating the primary key constraint. Right now I just show a messagebox to let users know that duplicates are present in the csv file, but this is of course not a proper solution, actually not a solution at all. My question is, is there any way to ignore these duplicate rows and just skip them and only add the rows that are not duplicate? Perhaps in a try catch somehow?

If it is not possible, what would be the "correct" (for lack of a better word) way to import data from the csv file? The exception is causing me a bit of trouble. I did read somewhere that you can set up a temporary table, load the data into it and select distinct between the two tables before inserting. But is there really no easier way to do it with bulk insert?

Upvotes: 4

Views: 13507

Answers (2)

ralofpatel
ralofpatel

Reputation: 59

First of all there is no direct solution like BULK INSERT WHERE NOT EXISTS. You can use following solutions.

When using BULK INSERT there are two scenarios

  1. You are BULK INSERTing in a empty table
  2. You are BULK INSERTing in a already filled table

Solution for Case 1: set the MAXERRORS = 0 set the BATCHSIZE = total number of rows in csv file

Using above statement with BULK INSERT would cause whole BULK INSERT operation to roll back even if there is a single error, this would prevent from rows being imported even when there are errors in few rows. You would need to solve all import errors to complete the import operation. This method would prevent situations when you import 50 rows, 30 gets imported and remaining not. Then you have to search CSV file for the failed ones and reimport them or delete all imported rows from SQL table and do BULK INSERT again.

Solution for Case 2: 1> You can run select query on the existing table, right click and export in CSV. If you have any spreadsheet program then paste the data below import data and use conditional formatting on primary key column to highlight duplicate rows and delete them. Then use BULK INSERT operation.

2> Set MAXERRORS = number of rows, and import csv file using BULK INSERT. This is not safe and recommended way, as there might be other kinds of errors apart from duplicate key errors

3> Set BATCHSIZE = 1 and MAXERRORS = high number, and import csv file using BULK INSERT. This would import all the rows without errors and any row with errors would be skipped. This is useful if the data set is small and you can visually identify the rows which are not imported by observing table columns like id number column which shows the missing numbers.

4> Right click on existing table, choose table as > Crete to > new query window. Simply rename the table name and change to staging name like table_staging. BULK Insert in the staging table and then run a second query to copy data from staging table to main table and using WHERE clause to check if row/pk exists. This is a much safer way but forces you to create a staging table.

Upvotes: 1

DoctorMick
DoctorMick

Reputation: 6793

You could set the MAXERRORS property to quite a high which will allow the valid records to be inserted and the duplicates to be ignored. Unfortunately, this will mean that any other errors in the dataset won't cause the load to fail.

Alternatively, you could set the BATCHSIZE property which will load the data in multiple transactions therefore if there are duplicates it will only roll back the batch.

A safer, but less efficient, way would be to load the CSV file in to a separate, empty, table and then merge them into your orders table as you mentioned. Personally, this is the way I'd do it.

None of these solutions are ideal but I can't think of a way of ignoring duplicates in the bulk insert syntax.

Upvotes: 3

Related Questions