Efficient way of populating Data Warehouse Dimensions from a staging table?

I have a SSIS ETL package that loads a CSV file into a staging table that has 100+ columns of accounting data, we are creating a datawarehouse from this data.

From the 100+ columns of data, below we can identify the columns needed to create our "product" dimension...

Example data..

product_id,product_name,product_description
'1','computer','power pc'
'2','case','atx case'
'1','computer','power pc'
'3','cpu'intel'
'3','cpu'intel'

So, what I did is created a DIM.PRODUCT table, that is...

CREATE TABLE DIM.OFFICE ( 
"product_key" int IDENTITY (1,1) NOT NULL, -- Surrogate Key
"product_id" varchar(150) NOT NULL, -- Business Key
"product_name" varchar(150) NOT NULL,
"product_Description" varchar(150) NOT NULL,
)

We know that the "product id" from the example data uniquely identifies the product, so a product id of #1 will always have a product_name = computer, and product_description = power pc.

The staging table has 3 million records, how do I populate my dimension from the staging table, so that later I can look it up when populating the fact table.

The dimensions are a type 1 dimension, product id #1 will always be "computer","power pc".

Previously I have been using SELECT DISTINCT "product id","product_name","product_description" to only get the distinct values from the staging table, and load that into the dimension, but I am finding this method to be taxing the system of resources when populating the dimensions.

Is there a better, more efficient (performance wise) way of doing this? (using TSQL or SSIS)..

We are running SQL 2008 SP3.. with plans to upgrade to SQL 2012 shortly.

Upvotes: 1

Answers (3)

gufeng

Reputation: 26

Dimension table come from fact table, maybe it is a design problem. product dim should be created in your online system
SELECT DISTINCT is the best option right now. Maybe you can try to load CSV into hash partitioned table partition by product_id

Upvotes: 0

Andras Vass

Reputation: 11638

Assuming that SSIS and SQL does not fight for the same resources, you may consider appending
OPTION (FAST 10000) to the query.
This has the tendency to result in a non-blocking query plan - hash match, instead of sort for DISTINCT - on the SQL Server. Although the SQL query itself may run longer, the processing as a whole might speed up, since SSIS can proceed with the stream in parallel. The end result might be a considerable speedup.

Upvotes: 0

Nick.Mc

Reputation: 19225

select distinct usually needs to perform a sort and this is actually where all the work is done. (You can verify this by viewing the query plan). You can optimise this in a couple of ways:

Make sure the data is already physically sorted in the required order by creating a clustered index. This 'pre sorts' the data, but of course now when you import the data you have to pre sort it on the fly, so this now slows down the import into your staging table. This might be practical if the import>staging can take some time but the dimension load must not.
Optimise your database sorting muscle by making sure TempDB is working well.

Both are really DBA type questions. Do you have a DBA?

Upvotes: 1

Efficient way of populating Data Warehouse Dimensions from a staging table?

Answers (3)

Related Questions