Len
Len

Reputation: 43

Look up transformation in SSIS bug:doesn`t prevent insertion of already existing record

I have SSIS package that reads from a source, performs look up transofrmation to check if the record exists in the destination , if it exists it redirects to match output and updates it, otherwise to no match output and inserts it. The problem is that sometimes it inserts a record that should be redirected for update. This is done via job, if I manually execute the package, everything is fine. The look up component is set up correctly with the matching column.

I can`t find out why this happens, the silliest thing is I can not debug it because manually everything is fine.

Any ideas?

Upvotes: 1

Views: 1070

Answers (1)

billinkc
billinkc

Reputation: 61211

Two options on I the scenarios where you have inserts that should have been updates.

Duplicate source values

The first is that you have duplicate keys in your source data and nothing in the target table.

Source data

Key|Value
A  |abc
B  |bcd
A  |cde

Destination data

Key|Value
C  |yza
B  |zab

In this situation, assuming the default behaviour of the Lookup Component, full cache, before the package begins execution, SSIS will run the source query for the Lookup reference table. Only once all the lookup table data has been cached will the data flow being flowing data.

  • The first row, A:abc hits the lookup. Nope, no data and off to the Insert path.
  • The second row B:bcd hits the lookup. Nope, no data and off to the Insert path.
  • The third row A:cde hits the lookup. Nope, no data and off to the Insert path (and hopefully a primary/unique key violation)

When the package started, it only knew about data in the destination table. During the run you added the same key value to the table but never asked the lookup component to check for updates.

In this situation, there are two resolutions: The first is to change the cache mode from Full to None (or Partial). This will have the lookup component issue a query against the target table for every row that flows through the data flow. That can get expensive for large rows. It also won't be fool proof because the data flow has the concept of buffers and in a situation like our sample 3 row load, that would all fit in one buffer. All the rows in the buffer would hit the Lookup at approximately the same time and thus the target table will still not contain an A value when the third row flows through the component. You can put the brakes on the data flow and force it to process one row at a time by adjusting the buffer size to 1 but that's generally not going to be a good solution.

The other resolution would be dedupe/handle survivorship. Which A row should make it to the database in the event our source has different values for the same business key? First, last, pick one? If you can't eliminate the data before it hits the Data Flow, then you'll need to deduplicate the data using an Aggregate component to rollup your data best you can.

Case sensitive lookups

Source data

Key|Value
A  |abc
B  |bcd
a  |cde

Destination data

Key|Value
C  |yza
B  |zab

The other scenario where the Lookup component bites you is that the default, Full Cache, matching is based on .NET matching rules for strings. Thus AAA is not equal to AaA. If your lookup is doing string matching, even if your database is case insensitive, the SSIS lookup will not be insensitive.

In situations where I need to match alpha data, I usually make an extra/duplicate column in my source data which is the key data all in upper or lower case. If I am querying the data, I add it to my query. If I am loading from a flat file, then I use a Derived Column Component to add my column to the data flow.

I then ensure the data in my reference table is equally cased when I use the Lookup component.

Lookup Component caveats

Full Cache mode: - insensitive to changes to the reference data - Case sensitive matches - generally faster overall - delays data flow until the lookup data has been cached - NULL matches empty string - Cached data requires RAM

No Cache mode: - sensitive to changes in the reference - Case sensitivity matching is based on the rules of the lookup system (DB is case sensitive, you're getting a sensitive match) - It depends (100 rows of source data, 1000 rows of reference data - no one will notice. 1B rows of source data and 10B rows of reference data - someone will notice. Are there indexes to support the lookups, etc) - NULL matches nothing - No appreciable SSIS memory overhead

Partial Cache: The partial cache is mostly like the No cache option except that once it gets a match against the reference table, it will cache that value until execution is over or until it gets pushed out due to memory pressure

Lookup Cache NULL answer

Upvotes: 3

Related Questions