Ben Thul
Ben Thul

Reputation: 32697

SSIS incremental load

I'm attempting to transfer approximately 1 billion rows with an SSIS package over an unstable line. As such, it keeps failing part way through. I'd like some way to make it restartable. I attempted to put a lookup transform between the source and destination, but that makes it way too slow. Is there another way to do what I'm trying to do without taking such a performance hit?

Upvotes: 0

Views: 788

Answers (2)

uh_big_mike_boi
uh_big_mike_boi

Reputation: 3470

Is it worth doing something like this at a million rows at a time

DECLARE @Counter Int, @ReturnCode Int
DECLARE @Rows INT = 1000000, @Goal INT = 1000000000
  WHILE (@Counter * @Rows < @Goal)
    BEGIN
      EXEC @returncode = xp_cmdshell 'dtexec /f "C:\\path\\package.dtsx" /SET \\package.variables [User::Counter].Value;"' + CONVERT(VARCHAR(10), @Counter) + '"'

      IF (@returncode = 0)
        BEGIN
          @Counter = @Counter + 1
          PRINT 'Failed this at this time dude: ' + CONVERT(VARCHAR(30), GETDATE())
        END
    END

And then inside your oledb source component in your query have a where clause like

    WHERE TableID BETWEEN (SELECT MIN(TableID) FROM Table) + ((? - 1) * 1000000)) 
          AND (SELECT MIN(TableID) FROM Table) + (? * 1000000))

A couple things though - I think you will have to build the query in an expression in a SSIS variable (have you ever done that before? its easy, it will give an error using parameter in subquery sometimes) and you should probably grab the MIN(TableID) just once at the beginning and store it in variable too.

Is it all worth this much trouble, do you think this will help?

Upvotes: 1

billinkc
billinkc

Reputation: 61211

My initial approach would be to write a package that when it starts, it identifies a subset of the data to be transfered, records what subset it is working on and attempts to transfer that data. If it completes, then it marks that data as having been transferred and exits. Otherwise, well it's already blown up and there's nothing for that package to do.

Another process would run X timeframe and attempt to find failed transfers (subsets marked as in process but older than Y duration). It'd then delete those rows from the transfer table or mark them as eligible for transfer. The general idea being that stuff that broke is flagged for a do-over.

It's a pretty simple design, the hardest part would be segmenting and keeping track of what has/hasn't been transferred and then simply set up a SQL Agent job to fire the package every N timeframes. If it wasn't your network that was faulty, a nice thing about doing data transfer like this is that it allows you to parallelize your executions so that you get maximum throughput. The SSIS team took a similar approach when they set the record for loading data. They also goosed the hell out of the system but that's too be expected.

If the 1B row system has lots of updates going on, then your data segmenting logic will need some mechanism to identify change and ensure those records are fed back into the system but since you didn't specify that as a need, I'll ignore it for now.

If you are trying to use a lookup, ensure you are only pulling back what is absolutely essential. For this case, I would hope it's a very narrow key. If you're taking the segmented data load approach, then ensure you use the same partitioning logic in your lookup transformation. Otherwise, you'll pull back (1B -1 *transferSize) rows of data on your final run which will undoubtedly play havoc on your shoddy network.

Lots of generalities but sing out if you want more details on a facet.

Upvotes: 2

Related Questions