tad
tad

Reputation: 33

Do SSIS lookup transformations cache only relevant columns?

I have a full cache lookup that seems to be causing a failure. It was pointing directly to the lookup table via the drop down list in the Lookup configuration, which my understanding is equivalent to a SELECT *. I thought that by changing the lookup configuration to use results from a SQL statement, in which I selected only the lookup value and the return value, that it would reduce the cache size. However, I'm seeing no decrease in cache size after this change was made. Is this because the Lookup is smart enough to cache only the columns specified in the columns section of the configuration, as opposed to the entire table?

Upvotes: 1

Views: 205

Answers (1)

billinkc
billinkc

Reputation: 61211

You are correct in that selecting the table in the drop down is equivalent to a SELECT * against the table and thus the lookup will cache all the data, regardless of column usage.

For performance, I always advise people to write a custom query pulling on the columns required and where appropriate, filter the depth of the rows retrieved.

However, your question got me to actually looking. These results are based on SQL Server 2017

I build out 4 packages. The basic pattern uses Adventureworks2014DW's FactInternetSales as the driver (OLE DB Source) against a Lookup of DimCustomer matching on CustomerKey

Operation 5 - Default table selection. No columns retrieved - just linked

Operation 6 - Custom Query for Connection SELECT DC.CustomerKey FROM dbo.DimCustomer AS DC

Operation 7 - Custom Query for Connection with a filter SELECT DC.CustomerKey FROM dbo.DimCustomer AS DC WHERE EXISTS (SELECT * FROM [dbo].[FactInternetSales] AS FIS WHERE FIS.CustomerKey = DC.CustomerKey )

Operation 9 - Duplicated the first package and this time, selected all the attributes from the Customer dimension, less the key. These are the results

I deployed the Integration Services catalog and run under Verbose logging. I used a query like this to retrieve information metrics

SELECT
    OM.operation_id
,   OM.message
FROM
    SSISDB.catalog.operation_messages AS OM
WHERE
    OM.message like 'Data flow task:inform%lookup%'
    AND OM.operation_id > 4
ORDER BY
    OM.operation_id
,   OM.message_time;

The following are my results.

operation_id    message
5   Data Flow Task:Information: Lookup has cached 8192 rows.  
5   Data Flow Task:Information: Lookup has cached a total of 18484 rows.  
5   Data Flow Task:Information: The Lookup processed 18484 rows in the cache. The processing time was 0.062 seconds. The cache used 739360 bytes of memory.  
6   Data Flow Task:Information: Lookup has cached 8192 rows.  
6   Data Flow Task:Information: Lookup has cached a total of 18484 rows.  
6   Data Flow Task:Information: The Lookup processed 18484 rows in the cache. The processing time was 0.109 seconds. The cache used 739360 bytes of memory.  
7   Data Flow Task:Information: Lookup has cached 8192 rows.  
7   Data Flow Task:Information: Lookup has cached a total of 18484 rows.  
7   Data Flow Task:Information: The Lookup processed 18484 rows in the cache. The processing time was 0.14 seconds. The cache used 739360 bytes of memory.  
9   Data Flow Task:Information: Lookup has cached 980 rows.  
9   Data Flow Task:Information: Lookup has cached a total of 18484 rows.  
9   Data Flow Task:Information: The Lookup processed 18484 rows in the cache. The processing time was 0.281 seconds. The cache used 38502172 bytes of memory.  

Summary

As we can see, and to my shock, there must be some optimization/pushdown back to the source system as the cache size was identical for the first 3 runs. It's only when I used the results of the lookup that the cache was increased.

I'd like to say I'd spin up my VM and test this behaviour against 2005 to 2016 but ain't nobody got time for that. I would be surprised if this efficiency existed before 2012 though.

Upvotes: 1

Related Questions