Reputation: 2017
The TLDR version: Bulk Insert will tell you how many rows it affected. It will NOT tell you how many rows it TRIED to affect, or how many failed. The problem with this is intuitively obvious, and I want to know if there is a more reliable way of uploading from text files, keeping the code inside the server.
The Full Version: I have an application which needs to periodically upload text data files into SQL Server tables. For some wild and crazy reason I thought to put this in a stored procedure to make it part of an abstraction layer rather than have the front end application write directly to the tables.
As with most SQL Server scripting I spent my usual share of time bashing my head against a brick wall to get it to work at all. (With no small amount of help from searching past posts on this site and others.)
Will Bulk Insert read a header row to determine which fields to write to? Nope, I have to either use a Format File (and hope that the table structure / order never changes) or use a staging table containing only the columns in the data file. The staging table receives the data prior to copying it to the real table.
What if you omit a single column in the target table (even one which has a default) and don't use a Format File? You get the oh-so-self-explanatory error message "Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)"." Using the aforementioned staging table, omitting any columns which aren't in the data source, gets around this.
That's fine, I can live with that. It's not the scary part. This is.
If the staging table still has any fields which are defined as NOT NULL, but a data source row is null for that column, you do not get an error. Based on my tests, if you have (say) 5 rows of data and row 3 is missing data in a NOT NULL field, then you won't get an error but will get the message "4 rows updated". Which is all fine and dandy if you're expecting 5 rows and do a cross-check against the number of rows affected to make sure that the expected number is there, but these files will be of varying lengths and Bulk Insert will not tell you how many rows it actually read. Worse, in some cases a missing field in one row will prevent the following (valid) row from uploading as well.
The obvious solution? Remove the NOT NULL constraints from the staging table and handle any null exceptions in the interface between the staging table and the real table. However... my concern is that this piece of cr...ode may do the same thing in other circumstances that I haven't come across yet. That is, read a row, fail to write it to the staging table, and not throw an exception so that nobody knows that data is missing until they go looking for it and find that it just ain't there. Even Access has better text import options than this.
My question, then, is... is there a better (more reliable) way of handling uploads of variable row length text files to SQL Server without having to rely on the front end app to do it?
Thanks in advance for any advice.
Upvotes: 0
Views: 602
Reputation: 89711
I would still use SSIS for this even though handling ragged files is still difficult with it. It gives you a lot of control and good performance.
Upvotes: 0
Reputation: 11396
.NET's SqlBulkCopy
would give you more control when transferring data to the staging table, and still provide great performance.
Now, considering you want all the logic on the server itself, maybe the following strategy might work for you:
Keep the staging table without any sort of restrictions that may make your process fail. Have additional columns on it created, such as an IsValid
, plus an ErrorMessage
column.
Then write a stored procedure that will validate the data on each row, checking for conditions, missing values, length of text, any custom business rules you may want to enforce.
Each validation could be a single UPDATE
statement to the staging table, setting IsValid
or an ErrorCode
plus an ErrorMessage
, having the condition on its WHERE
clause, for example.
Once all validation UPDATE
statements have run, you should have a subset of rows that are valid for importing (those marked as IsValid) and others on which you can clearly identify the cause of the problem and log or inform the user on a row by row basis.
That strategy proved to be quite effective under scenarios where invalid data is imported frequently.
Now, if you are going to be running multiple import processes in parallel, then you may need to divide your imported data in the staging table according to some "process identifier" and possibly index it, to avoid locking between the processes when running the mass validations on the data.
Upvotes: 1
Reputation: 180858
The SqlBulkCopy class in System.Data.Client should give you more control and feedback.
Upvotes: 0