Andrew Keller
Andrew Keller

Reputation: 3239

Restrict Lookup Transformation Partial Cache to Loaded Values

I am using SSIS to load data from flat files to a SQL table. The flat files contain both new and updated rows. Each time the process is run, the updated rows will affect a small subset of the SQL table, specified by a 'period' column (e.g. one procedure may only affect periods 3, 4, and 5).

I am using a Look-Up transformation to separate new rows (Lookup No Match Output) from existing row s(Lookup Match Output). Since both the reference set and the data set being loaded is extremely large, I would like to use partial caching for the lookup. Is it somehow possible to modify the partial caching query to only include rows from the period numbers included in the flat files?

For example, my reference table may contain data from periods 1-10, but my flat files being loaded may only have data from periods 3-5. Therefore, I only want to cache data from periods 3-5, since I already know periods 1-2 and 6-10 will never produce a match.

Upvotes: 2

Views: 2911

Answers (2)

billinkc
billinkc

Reputation: 61211

Instead of using the table selector in the drop down, which you should never do unless you need every column from every row, write your query to only pull back the columns you need for either matching or augmenting the existing data. In your case, you're going to need to add a filter which is a bit persnickety.

The best approach I've found is to write the lookup query in a variable of type String. In it, I will build the query and apply the needed filter. Below, you see I defined two variables. One an int which will serve as my filter and then the query itself which uses it.

enter image description here

The expression on my SourceQuery Variable is

"SELECT 
    D.rn
FROM
(
SELECT TOP 10
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))  * 2 AS rn
FROM
    sys.all_columns AS SA
) AS D(rn) 
WHERE D.rn <= " + (DT_WSTR, 10) @[User::MaxID]

My Data Flow looks like

enter image description here

I have my source and it hits a lookup and based on matched results it goes to one of the two buckets. My source query just generates the numbers 1 to 10 and the lookup is a query that generates even numbers from 2 to 20.

During design time, that query looks like

SELECT 
    D.rn
FROM
(
SELECT TOP 10
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))  * 2 AS rn
FROM
    sys.all_columns AS SA
) AS D(rn) 

A normal run would result in a 50/50 split between the buckets

enter image description here

The goal of course is to make the lookup query take a parameter like one of the source components but you'd quickly discover that

SELECT 
    D.rn
FROM
(
SELECT TOP 10
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))  * 2 AS rn
FROM
    sys.all_columns AS SA
) AS D(rn) 
WHERE D.rn > ?

doesn't fly. Instead, you have to go back out to the Control Flow and select the Data Flow, Right click and select Properties. In the window for your data flow, go to Expressions and click the ellipses (...)

enter image description here

There will be a property named as your Lookup task. Assign the Variable that uses the expression to make it all dynamic and voila, with a maxid of 6 I only find 3 matches

enter image description here

A final note, the Partial Cache may or may not be what you're looking for. That's an actual lookup setting that controls how it balances the cost of lookup data versus caching it locally. A full cache will drop all the specified columns for the applicable range into memory which is why you only want to specify the columns you need. If you can get it down to a few skinny columns and even if it's millions of rows, you probably aren't going to feel pain.

Upvotes: 2

Kyle Hale
Kyle Hale

Reputation: 8120

Contrived example:

  1. Glean your period minum and maximum at runtime and store them in two variables, PeriodMinimum and PeriodMaximum (I'm assuming it's a range, I'll discuss alternatives at the end.)
  2. Add them as Derived Columns to your Source flow.
  3. In the Lookup Editor, under the Advanced tab, use a custom query (Contrived example): SELECT lookup, value FROM reference where period between ? and ?
  4. Click the Parameters button and use your input columns appropriately.

If instead of a range you want to be able to randomly choose periods (3, 6, and 10) you'll have to do something a bit more contrived ...

  1. Create multiple variables, Period1, 2, 3 ... n and set the default to -1 or some value which is not a valid period.
  2. Populate these variables as needed with the periods you do want to filter on.
  3. In the custom query, use SELECT lookup, value FROM reference where period = ? or period = ? or period = ?, ...
  4. Set each parameter using your input columns.

Anyway, in general, use the Custom query with parameters when you want a dynamic lookup query based on runtime data.

Upvotes: 1

Related Questions