bjnr
bjnr

Reputation: 3437

Automate Import of a 10 GB XML file to SQL Server

I get a 10 GB XML file from a vendor every day and want to import the file into SQL Server tables. What would be the right approach?

Upvotes: 0

Views: 416

Answers (1)

David Roussel
David Roussel

Reputation: 5916

That's a big old XML file. This is how I would do it, and on a recent project I was receiveing fairly large files for import.

Firstly I'd make sure I was receiving the file as a zip file or gzip file, I'd do this in java, but it could be done in python or C#. Then I'd uncompress in a stream (not the whole file at once, just reading from a compressed stream).

Then I'd parse the file in a streaming parser. In java I'd use STaX, but in other languages other choices would be available. Then as I'd reading the XML I'd be gathering data and writing it down to CSV (tab separated) files that can be passed to bcp.exe.

I'm not sure of the structure of your data, but maybe you can put it in one CSV file, or maybe you need multiple types of CSV file. Either way I'd try to not create CSVs bigger than 50MB. Then once a CSV file has gona past the size threshold, I'd close it and pass it to another thread and continue XML parsing.

In the second thread I'd them fork out to bcp.exe to load the data.

If you need to load to multiple tables, you can still do it via one CSV file, but then BCP into a view and have an 'instead of insert trigger' on the view. This trigger can normalise the data and lookup primary keys and insert to child tables etc.

If you are doing this in C#, then maybe you don't need to use bcp.exe as the natice bulk loading it better than the java API.

This overall approach of convert to chunked CSVs, parallel upload, using a trigger to do lookups worked very well for us.

I had my version take a folder of 6GB of XML spread over hundreds of files, and load into the DB in a few minutes. And that was to 4 tables, but using one CSV file with a union of all the columns.

Upvotes: 1

Related Questions