SteveC
SteveC

Reputation: 16743

Tracking down data load performance issues in SSIS package

Are there any ways to determine what the differences in databases are that affect a SSIS package load performance ?

I've got a package which loads and does various bits of processing on ~100k records on my laptop database in about 5 minutes

Try the same package and same data on the test server, which is a reasonable box in both CPU and memory, and it's still running ... about 1 hour so far :-( Checked the package with a small set of data, and it ran through Ok

Upvotes: 0

Views: 6622

Answers (4)

Mark
Mark

Reputation: 9428

I've had similar problems over the past few weeks, and here are several things you could consider, listed in decreasing order of importance according to what made the biggest difference for us:

  1. Don't assume anything about the server. We found that our production server's RAID was miscconfigured (HP sold us disks with firmware mismatches) and the disk write speed was literally a 50th of what it should be. So check out the server metrics with Perfmon.

  2. Check that enough RAM is allocated to SQL Server. Inserts of large datasets often require use of RAM and TempDB for building indices, etc. Ensure that SQL has enough RAM that it doesn't need to swap out to Pagefile.sys.

  3. As per the holy grail of SSIS, avoid manipulating large datasets using T-SQL statements. All T-SQL statements cause changed data to write out to the transaction log even if you use Simple Recovery Model. The only difference between Simple and Full recovery models is that Simple automatically truncates the log file after each transactions. This means that large datasets, when manipulated with T-SQL, thrash the log file, killing performance.

  4. For large datasets, do data sorts at the source if possible. The SSIS Sort component chokes on reasonably large datasets, and the only viable alternative (nSort by Ordinal, Inc.) costs $900 for a non-transferrable per CPU license. So... if you absolutely have to a large dataset then consider loading it into a staging database as an intermediate step.

  5. Use the SQL Server Destination if you know your package is going to run on the destination server, since it offers roughly 15% performance increase over OLE DB because it shares memory with SQL Server.

  6. Increase the network packaet size to 32767 on your database connection managers. This allows large volumes of data to move faster from the source server/s, and can noticably improve reads on large datasets.

  7. If using Lookup transforms, experiment with cache sizes - between using a Cache connection or Full Cache mode for smaller lookup datasets, and Partial / No Cache for larger datasets. This can free up much needed RAM.

  8. If combining multiple large datasets, use either RAW files or a staging database to hold your transformed datasets, then combine and insert all of a table's data in a single data flow operation, and lock the destination table. Using staging tables or RAW files can also help relive table locking contention.

  9. Last but not least, experiment with the DefaultBufferSize and DefaulBufferMaxRows properties. You'll need to monitor your package's "Buffers Spooled" performance counter using Perfmon.exe, and adjust the buffer sizes upwards until you see buffers being spooled (paged to disk), then back off a little.

Point 8 is especially important on very large datasets, since you can only achieve a minimally logged bulk insert operation if:

  • The destination table is empty, and
  • The table is locked for the duration of the load operation.
  • The database is in Simply / Bulk Logged recovery mode.

This means that subesquent bulk loads a table will always be fully logged, so you want to get as much data as possible into the table on the first data load.

Finally, if you can partition you destination table and then load the data into each partition in parallel, you can achieve up to 2.5 times faster load times, though this isn't usually a feasible option out in the wild.

Upvotes: 5

SteveC
SteveC

Reputation: 16743

CozyRoc over at MSDN forums pointed me in the right direction ...
- used the SSMS / Management / Activity Monitor and spotted lots of TRANSACTION entries
- got me thinking, read up on the Ole Db connector and unchecked the Table Lock
- WHAM ... data loads fine :-)

Still don't understand why it works fine on my laptop d/b, and stalls on the test server ?
- I was the only person using the test d/b, so it's not as if there should have been any contention for the tables ??

Upvotes: 0

Jeremy Smyth
Jeremy Smyth

Reputation: 23493

If you've ruled out network latency, your most likely culprit (with real quantities of data) is your pipeline organisation. Specifically, what transformations you're doing along the pipeline.

Data transformations come in four flavours:

  • streaming (entirely in-process/in-memory)
  • non-blocking (but still using I/O, e.g. lookup, oledb commands)
  • semi-blocking (blocks a pipeline partially, but not entirely, e.g. merge join)
  • blocking (blocks a pipeline until it's entirely received, e.g. sort, aggregate)

If you've a few blocking transforms, that will significantly mash your performance on large datasets. Even semi-blocking, on unbalanced inputs, will block for long periods of time.

Upvotes: 1

Rob
Rob

Reputation: 3066

In my experience the biggest performance factor in SSIS is Network Latency. A package running locally on the server itself runs much faster than anything else on the network. Beyond that I can't think of any reasons why the speed would be drastically different. Running SQL Profiler for a few minutes may yield some clues there.

Upvotes: 0

Related Questions