c#sql-servermultithreadingparallel-processing

Reputation: 11607

Parallelizing massive inserts in SQL Server from C# (for better time performance)

Problem statement : How to parallelize inserts in SQL Server (2008)

I am performing massive numeric computation for scientific research in C# multithreaded workers that basically do one thing : Test thousands of possible configurations (matrix combinations) through a time period (in days) and store the results into an SQL Server Database.

If i store the results one by one into DB (~300.000 lines per computing session * 100's of sessions), one after the other, I end up waiting for hours for the storing process to end.

The database design is very simple :

Combination Sets
CS_ID1, Value A1, Value B1, Value C1
CS_ID2, Value A2, Value B2, Value C2
.........
Results per Day
CS_ID1, Day1,Result 1
CS_ID1, Day2,Result 2
CS_ID1, Day3,Result 3
.........

.........
CS_ID2, Day1, Result N
CS_ID2, Day2, Result N+1
CS_ID2, Day3, Result N+2

Each "Combination Set" is tested against sample days and its per-day results are processed in a single C# thread, where a LINQ/SQL query is generated and sent to DB just before the end of the thread. Except combination set IDs sequences, there is NO logical relation between Results. This is very important : This is why I thought of parallelizing the insert stuff as it basically amounts to a bulk dump of result blocks

Another detail that could be important is that it is possible to determine beforehand how much rows will be inserted into the Database (per block and in total). This probably could help organize table spaces, split them through pages, pre-fix id ranges in order to store blocks simultaneously, or something like that (No, i'm not "high" or something :-) )

I welcome any kind of suggestions in order to make this insert time as short as possible.

Please take into account that I am a C# developer, with very basic SQL Server knowledge and not very familiar with deep technical DBA concepts (I saw that Locking tweaks are VERY numerous, that there are multithreaded and asynchronous capabilities, too, but I have to admit I am lost alone in the forest :-) )

I Have 12 CPU Cores available, and 24Go RAM

EDIT: Tiebreaker
I welcome any clever suggestion on monitoring time for the whole process : From C# threads inception/end to detailed SQl server insert reports (What happens when, how, and where).
I tried logging whith NLog but it drastically biases the processing time so I am looking for some smart workarounds that are pretty seamless with minimum impact. Same for the SQL server part : I know there are a couple of Logs and monitoring SP's available. I did not figure out yet which ones suit my situation.

Upvotes: 7

Answers (7)

Remus Rusanu

Reputation: 294437

300k inserts is a matter of seconds, at worst minutes, not hours. You must be doing it wrong. The ETL SSIS world record back in 2008 was at 2.36 TB/hour, 300k records is nothing.

The basic rules of thumb are:

batch commit. this is the most important thing. Don't INSERT a row, then INSERT a row, then INSERT a row at nauseam, each insert int its own transaction. Your program has to wait for the log (LDF) to flush after each statement int his case, and will be slow. Very slow. Instead start a transaction, then insert a batch of rows, then commit the transaction:

Pseudocode:

do
  {
  using (TransactionScope scope = new TransactionScope(
     Required, new TransactionOptions() {IsolationLevel = ReadCommitted))
  {
    for (batchsize)
    {
      ExecuteNonQuery ("Insert ...")
    }
    scope.Complete ();
  }
} while (!finished);

if possible, use SqlBulkCopy

The first option alone will get you above 3000 inserts per second (~2 minutes for 300k). Second option should get you into tens of thousands per second range. If you need more, there are more advanced tricks:

use heaps instead of b-trees (no clustered index)
disable secondary indexes
affinitize clients to soft NUMA nodes and push into locked tables per client conenction, then switch them all in using partition switching at the end. This is for Really high end, millions of rows per second.

I suggest you start with the basics of the basics: batch commits.

Upvotes: 10

Jon Onstott

Reputation: 13727

Here's an article on doing bulk insert using C#: Link

Additional thoughts on bulk insert with C# are in a Stack Overflow question: What’s the best way to bulk database inserts from c#?

Hope this helps.

Upvotes: 2

Raymund

Reputation: 7892

Maybe this might help you

I have a step by step guide on how to execute parallel stored procedures in SQL here.

You might be able to combine bulk insert with this one.

Upvotes: 1

Pavel Urbančík

Reputation: 1496

This is an interesting problem. First, how are you using the values in the database? Do they participate in subsequent calculations or is database just "dump" to store the results for later processing? Also is you application/process running 24 hours a day?
Why am I asking - if you could split the "store results" and "process results" operations, you could achieve higher throughput by "blobbing" the data from one session and storing them as one blob. Later, in off-peek time, you could walk and process and "expand" these blobs into tables for example using job or another process. In theory, if this would be OK, you could store these "staging" blobs in binary files, not directly in database, to achieve probably maximum possible write speed (limited only by the file system, OS and underlying disc hardware).

Upvotes: 1

thkala

Reputation: 86433

If you are using a separate transaction for each insert, that would definitely affect performance, as the DB server would have to atomically perform each insert. I have never used SQL server, but most SQL variants have a way to bunch more than one inserts in a single transaction, usually with something like

BEGIN TRANSACTION;

...<various SQL statements>...

COMMIT TRANSACTION;

For the SQL server syntax see:

http://msdn.microsoft.com/en-us/library/ms188929.aspx

http://msdn.microsoft.com/en-us/library/ms190295.aspx

In my experience bundling inserts like this definitely helps with server performance and, to some extent, resource and network usage.

EDIT:

Most (all?) decent DB servers use some sort of per-row locking, rather than per-table locks. You should be able to have multiple concurrent transactions, each with multiple inserts, with no problem - that's what DB servers are designed for. You could certainly have each worker thread perform its own transactions, thus parallelizing the inserts from different threads.

Since you are apparently using a single computer for the computations and the DB, extensively parallelizing DB transactions would not affect performace too much and it might even make it worse, since you don't really have any network latencies to reduce the impact of. As long as all CPU cores are busy, which would probably imply a number of workers >= 12, you should be looking at other optimisations.

If your threads generate their output in one go after processing (e.g. if you compute a large matrix and then dump in to the database) I doubt you would gain anything by storing the result into a file and then having the DB read it back into a table.

If, on the other hand, your threads do their output piece-by-piece you might benefit by storing parts of their output in memory, then inserting those parts in the DB, performing more than one transactions per round. Raising the number of worker threads in that case might allow you to have better CPU utilisation while the DB is storing the data, if the CPU is underutilised.

Storing the worker output in a file should IMHO be avoided since it effectively triples the load on the disk subsystem. The only reason you might want to do that is if you really don't have the memory for intermediate storing of the results.

Upvotes: 5

Gerardo Grignoli

Reputation: 15217

You can try using a Parallel For to do the inserts...

... but I would try BULK INSERT or Batch commit first...

Upvotes: 2

Will Marcouiller

Reputation: 24142

The BULK INSERT might help here.

Upvotes: 5

Parallelizing massive inserts in SQL Server from C# (for better time performance)

Answers (7)

Related Questions