user3104183
user3104183

Reputation: 408

ssis upserting 10^8 rows - process by batch?

I have to gather a large volume of data from various SQL Server tables (~around 300 million rows) and to upsert them into a single fact table in my data warehouse.

1/ What is the best strategy to import all these rows?

2/ Is this a good practice to import by batches? How big should be a batch? 10k rows is ok?

Upvotes: 1

Views: 1234

Answers (1)

rvphx
rvphx

Reputation: 2402

The way that I designed this was for a data movement between 3 different layers

  1. Landing Area
  2. Staging area (where most of the look ups and key substitutions happened)
  3. Data Warehouse

We created bulk tables in the landing area without any sort of key's or anything on there. We would simply land the data in that area and then would move it further along the system.

The way I designed the package was to create 2 very simple table in SQL Server with 4 columns each. The first table, I called it ToBeProcessed and the 2nd (quite obviously) Processed. The columns that I had were

1)

dbo.ToBeProcessed

(ID INT IDENTITY (1,1),

BeginDate DATETIME,
EndDate DateTime,
Processed VARCHAR(1)

)

2)

 dbo.Processed

( ID INT IDENTITY(1,1),
ProcessedEndDate DATETIME,
TableName VARCHAR (24),
CompletedDateTime DATETIME
)

What I did was to populate the ToBeProcessed Table with date ranges spanning a week each. For example 1st Row would be from 01/01/2014 to 01/07/2014, the next row would be from 01/08/2014 to 01/15/2014 and so on. This makes sure that you dont overlap any piece of data that you are pulling in.

On the SSIS Side you would want to create a for each loop container and parse through all the dates in the 1st table one by one. You can parametrize your Data Flow task with the variables you would create to store the dates from the For each loop container. Every time a weeks worth of data gets processed, you simple insert the end date into your 2nd table.

This makes sure that you have a track of the data you have processed. The reason for doing this is because if the package fails for any reason, you can start from the point of failure without repulling all the data that you have already processed (I think in your case, you may want to turn the T-Logs off if you are not working on production environment).

As for upserting, I think using a merge statement could be an option, but it all depends on what your processing time frames are. If you are looking to turn this around over the weekend, I would suggest using a stored proc on the data set and making sure that your Log tables can grow comfortably with that amount of data.

This is a brief summary of the quick and dirty way which worked for me. This does not mean its the best method out there, but certainly got the job done for me. Let me know if you have any questions.

Upvotes: 1

Related Questions