Reputation: 121
What's the most efficient method to load large volumes of data from CSV (3 million + rows) to a database.
I am siding with the option of reading, transforming and loading the data using a C# application row-by-row? Is this ideal, if not, what are my options? Should I use multithreading?
Upvotes: 12
Views: 8033
Reputation: 465
If you really want to do it in C#, create & populate a DataTable, truncate the target db table, then use System.Data.SqlClient.SqlBulkCopy.WriteToServer(DataTable dt).
Upvotes: 0
Reputation: 83
read the contents of the CSV file line by line into a in memory DataTable. You can manipulate the data (ie: split the first name and last name) etc as the DataTable is being populated.
Once the CSV data has been loaded in memory then use SqlBulkCopy to send the data to the database.
See http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.writetoserver.aspx for the documentation.
Upvotes: 0
Reputation: 180777
You will be I/O bound, so multithreading will not necessarily make it run any faster.
Last time I did this, it was about a dozen lines of C#. In one thread it ran the hard disk as fast as it could read data from the platters. I read one line at a time from the source file.
If you're not keen on writing it yourself, you could try the FileHelpers libraries. You might also want to have a look at Sébastien Lorion's work. His CSV reader is written specifically to deal with performance issues.
Upvotes: 5
Reputation: 3360
BCP is pretty quick so I'd use that for loading the data. For string manipulation I'd go with a CLR function on SQL once the data is there. Multi-threading won't help in this scenario except to add complexity and hurt performance.
Upvotes: 0
Reputation: 53595
You could use the csvreader to quickly read the CSV.
Assuming you're using SQL Server, you use csvreader's CachedCsvReader
to read the data into a DataTable which you can use with SqlBulkCopy to load into SQL Server.
Upvotes: 3
Reputation: 25601
I would agree with your solution. Reading the file one line at a time should avoid the overhead of reading the whole file into memory at once, which should make the application run quickly and efficiently, primarily taking time to read from the file (which is relatively quick) and parse the lines. The one note of caution I have for you is to watch out if you have embedded newlines in your CSV. I don't know if the specific CSV format you're using might actually output newlines between quotes in the data, but that could confuse this algorithm, of course.
Also, I would suggest batching the insert statements (include many insert statements in one string) before sending them to the database if this doesn't present problems in retrieving generated key values that you need to use for subsequent foreign keys (hopefully you don't need to retrieve any generated key values). Keep in mind that SQL Server (if that's what you're using) can only handle 2200 parameters per batch, so limit your batch size to account for that. And I would recommend using parameterized TSQL statements to perform the inserts. I suspect more time will be spent inserting records than reading them from the file.
Upvotes: 2
Reputation: 136577
You don't state which database you're using, but given the language you mention is C# I'm going to assume SQL Server.
If the data can't be imported using BCP (which it sounds like it can't if it needs significant processing) then SSIS is likely to be the next fastest option. It's not the nicest development platform in the world, but it is extremely fast. Certainly faster than any application you could write yourself in any reasonable timeframe.
Upvotes: 1