TH58PZ700U
TH58PZ700U

Reputation: 43

SSIS Lookup Transformation Stack

I have a data flow task within SSIS 2008 that pulls about 2,000,000 rows with an OLE DB source, then goes one-by-one through 24 lookup transformations replacing "code" data with either its pre-defined equivalent from a dimension table, or an "unknown" value.

This one-by-one process, with the entire flow running through each transformation has become a major bottleneck in the execution of the package and I need a way to speed up the process. Any ideas?

I've tried to multicast the data set to each of the 24 different lookups (so that only the necessary column is sent to it) but when I then run them all into a union all the task seems to not like the various data types and tends to throw errors no matter how I configure it. Is there another option I'm missing?

Upvotes: 0

Views: 7200

Answers (2)

user756519
user756519

Reputation:

It is possible that the bottleneck may not be in the place where you think it is. It could be the destination component that might be slowing down the performance of the package. The package transformations wait until the batch data is inserted into destination. This makes us believe that the transformations that appear in yellow are performing slowly. Actually, the lookup transformation tasks are really fast as far as what I have seen in my experience.

Following example reads 1 million rows from a flat file source and inserts into SQL Server. Even though it uses only one lookup, the reason why I have provided the example here is to give you an idea about having multiple destination components. Having multiple destinations to accept the data processed by various transformations will speed up the package.

I hope this example gives you an idea about how you can improve your package performance.

Step-by-step process:

  1. In the SQL Server database, create two tables namely dbo.ItemInfo and dbo.Staging. Create table queries are available under Scripts section. Structure of these tables are shown in screenshot #1. ItemInfo will hold the actual data and Staging table will hold the staging data to compare and update the actual records. Id column in both these tables is an auto-generated unique identity column. IsProcessed column in the table ItemInfo will be used to identify and delete the records that are no longer valid.

  2. Create an SSIS package and create 5 variables as shown in screenshot #2. I have used .txt extension for the tab delimited files and hence the value *.txt in the variable FileExtension. FilePath variable will be assigned with value during run-time. FolderLocation variable denotes where the files will be located. SQLPostLoad and SQLPreLoad variables denote the stored procedures used during the pre-load and post-load operations. Scripts for these stored procedures are provided under the Scripts section.

  3. Create an OLE DB connection pointing to the SQL Server database. Create a flat file connection as shown in screenshots #3 and #4. Flat File Connection Columns section contains column level information. Screenshot #5 shows the columns data preview.

  4. Configure the Control Flow Task as shown in screenshot #6. Configure the tasks Pre Load, Post Load and Loop Files as shown in screenshots #7 - #10. Pre Load will truncate staging table and set IsProcessed flag to false for all rows in ItemInfo table. Post Load will update the changes and will delete rows in database that are not found in the file. Refer the stored procedures used in those tasks to understand what is being done in these Execute SQL tasks.

  5. Double-click on the Load Items data flow task and configure it as shown in screenshot #11. Read File is a flat file source configured to use the flat file connection. Row Count is derived column transformation and its configuration is shown in screenshto #12. Check Exist is a lookup transformation and its configurations are shown in screenshots #13 - #15. Lookup No Match Output is redirected to Destination Split on the left side. Lookup Match Output is redirected to Staging Split on the left side. Destination Split and Staging Split have the exact same configuration as shown in screenshot #16. The reason for 9 different destinations for both destination and staging table is to improve the performance of the package.

  6. All the destination tasks 0 - 8 are configured to insert data into table dbo.ItemInfo as shown in screenshot #17. All the staging tasks 0 - 8 are configured to insert data into dbo.Staging as shown in screenshot #18.

  7. On the Flat File connection manager, set the ConnectionString property to use the variable FilePath as shown in screenshot #19. This will enable the package to use the value set in the variable as it loops through each file in a folder.

Test scenarios:

Test results may vary from machine to machine. 
In this scenario, file was located locally on the machine. 
Files on network might perform slower. 
This is provided just to give you an idea. 
So, please take these results with grain of salt.
  1. Package was executed on a 64-bit machine with Xeon single core CPU 2.5GHz and 3.00 GB RAM.

  2. Loaded a flat file with 1 million rows. Package executed in about 2 mins 47 seconds. Refer screenshots #20 and #21.

  3. Used the queries provided under Test queries section to modify the data to simulate update, delete and creation of new records during the second run of the package.

  4. Loaded the same file containing the 1 million rows after the following queries were executed in the database. Package executed in about 1 min 35 seconds. Refer screenshots #22 and #23. Please note the number of rows redirected to destination and staging table in screenshot #22.

Hope that helps.

Test queries: .

--These records will be deleted during next run 
--because item ids won't match with file data.
--(111111 row(s) affected)
UPDATE dbo.ItemInfo SET ItemId = 'DEL_' + ItemId WHERE Id % 9 IN (3)

--These records will be modified to their original item type of 'General'
--because that is the data present in the file.
--(222222 row(s) affected)
UPDATE dbo.ItemInfo SET ItemType = 'Testing' + ItemId WHERE Id % 9 IN (2,6)

--These records will be reloaded into the table from the file.
--(111111 row(s) affected)
DELETE FROM dbo.ItemInfo WHERE Id % 9 IN (5,9)

Flat File Connection Columns .

Name        InputColumnWidth     DataType          OutputColumnWidth
----------  ----------------     ---------------   -----------------
Id          8                    string [DT_STR]   8
ItemId      11                   string [DT_STR]   11
ItemName    21                   string [DT_STR]   21
ItemType    9                    string [DT_STR]   9

Scripts: (to create both tables and stored procedures) .

CREATE TABLE [dbo].[ItemInfo](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
    [IsProcessed] [bit] NULL,
    CONSTRAINT [PK_ItemInfo] PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [UK_ItemInfo_ItemId] UNIQUE NONCLUSTERED ([ItemId] ASC)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Staging](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
 CONSTRAINT [PK_Staging] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
GO

CREATE PROCEDURE [dbo].[PostLoad]
AS
BEGIN
    SET NOCOUNT ON;

    UPDATE      ITM
    SET         ITM.ItemName    = STG.ItemName
            ,   ITM.ItemType    = STG.ItemType 
            ,   ITM.IsProcessed = 1
    FROM        dbo.ItemInfo    ITM
    INNER JOIN  dbo.Staging     STG
    ON          ITM.ItemId      = STG.ItemId;

    DELETE FROM dbo.ItemInfo
    WHERE       IsProcessed = 0;
END
GO

CREATE PROCEDURE [dbo].[PreLoad]
AS
BEGIN
    SET NOCOUNT ON;

    TRUNCATE TABLE dbo.Staging;     

    UPDATE  dbo.ItemInfo 
    SET     IsProcessed = 0;
END
GO

Screenshot #1:

1

Screenshot #2:

2

Screenshot #3:

3

Screenshot #4:

4

Screenshot #5:

5

Screenshot #6:

6

Screenshot #7:

7

Screenshot #8:

8

Screenshot #9:

9

Screenshot #10:

10

Screenshot #11:

11

Screenshot #12:

12

Screenshot #13:

13

Screenshot #14:

14

Screenshot #15:

15

Screenshot #16:

16

Screenshot #17:

17

Screenshot #18:

18

Screenshot #19:

19

Screenshot #20:

20

Screenshot #21:

21

Screenshot #22:

22

Screenshot #23:

23

Upvotes: 0

Pondlife
Pondlife

Reputation: 16240

I would do it all in pure TSQL: insert the 2 million rows into a staging table and use UPDATE statements to set the values you need. That will almost certainly be much faster than a row-by-row lookup process, and you can also put indexes on the staging table if necessary.

After the data is updated, you can push it on to the destination table in another data flow or, if the staging and destination tables are on the same server, just use INSERT ... SELECT ... to do it.

Personally, I always avoid SSIS transformations if there's an easy way to do it in TSQL; performance is better and I find TSQL code easier to maintain than SSIS packages. Having said that, SSIS is a great control-flow tool for getting data from different places, delivering it to a staging database where you can work on it and executing procedures or scripts to transform the data.

Upvotes: 2

Related Questions