Mario Almaguer
Mario Almaguer

Reputation: 61

How to take data from 2 database and move it to a data warehouse using SSIS?

I am currently studying databases, and I have a question. The professor told us to create 2 different databases, and then move all the data to a star schema data model.

Here is the diagram for the first database, I have already filled the tables with data. Here is the diagram for the first database, i have already filled the tables with data.

This is the diagram for the second database, also with data

This is the diagram for the second database, also with data.

This is my star schema model And finally the star schema data model

The problem I am facing is i that i do not know how to start doing the mapping when adding my origin OLE DB and destination OLE DB.

I have already searched through the web, but I only find examples where they have to move just one database to the star schema data model.

Upvotes: 1

Views: 793

Answers (2)

billinkc
billinkc

Reputation: 61201

The task you have is to consolidate two transactional/OLTP systems into a data warehouse. The reason you find only examples of moving/mapping one system into a data warehouse is that you simply repeat the process for each additional source that feeds into the DW.

In your example, you are integrating two sales systems to produce a unified sales business process report.

My approach would be to copy all the tables as is into your DW and put them into a staging schema (stage_d1, stage_d2). This way, you have all the data locally and it's likely consistent with the start of your data extract i.e. as of 9:30 this morning.

Now that you have the data locally, you need to transform and enrich the data to populate your dimensions and then your fact table.

Let's analyze Dim_Customer. This table is probably a little light in terms of providing value but the methodology is what you should focus on. System 1 supplies a first and last name and a city, state and zipcode. System 2 gives us a Company name, contact name, city, state, postal code and phone.

Given the usage of postal code and zip code, that would have me wondering whether we're dealing with international addresses versus a US centric (zip code) data. I'd also notice that we don't have an actual address line for this data so the point is, analyse your data so you know that you're modeling something that solves the problem (report on sales across all systems).

The next question I'd wonder about is how we populate the Customer dimension. If a Mario Almaguer had a purchase in both system 1 and system 2, are they the "same" person? Does it matter for this business process? If we sold to a person in TX and that name also exists in ME, does it matter if the name is in there twice?

I'll assume we only care about unique customer names. If it's a bad assumption, we go back and model it differently.

In my source, I'll write a query.

SELECT DISTINCT CONCAT(C.FirstName, ' ', C.LastName) AS CustomerName FROM stage_d1.Customer AS C;

Run that, see that it returns the data I want. I'll then use an OLE DB Source in an SSIS data flow and use the third drop down option of Source query.

If I run the package, we'll get all the unique customer names from the source system. But we only want the names we don't have so that means we need to use something to check our reference table for existing matches. That's the Lookup Component

The source for the Lookup will be the DW's Dim_Customer table. You'll match based on CustomerName. The lookup component will tell us whether an incoming row matched and we can get two output streams: Match and no-match. We're only interested in the no-match path because that's new data. Andy Leonard has an excellent Stairway to Integration Services and in particular, we're talking about an Incremental Load.

From the Lookup, we'll drag the no-match branch to an OLE DB Destination where we point at Dim_Customer table.

You run that and Dim_Customer is populated. Run it again, and no new rows should be added as we're looking for new data only.

Now we need to solve getting the second staged customer data integrated. Fortunately, it's the same steps except this time our query is easier.

SELECT DISTINCT C.ContactName AS CustomerName FROM stage_d2.Customers AS C;

Lather, rinse repeat for all of your other dimensions.

You could also skipped the data flows and simply executed a query to do the same.

INSERT INTO dbo.Dim_Customer(CustomerName) 
SELECT DISTINCT CONCAT(C.FirstName, ' ', C.LastName) AS CustomerName 
FROM stage_d1.Customer AS C 
WHERE NOT EXISTS (SELECT * FROM dbo.Dim_Customer AS DC WHERE DC.CustomerName = CONCAT(C.FirstName, ' ', C.LastName));

Lather, rinse, repeat for the remaining dimensions.

Loading the fact is similar except we will use the Lookup components to find matches (as we need to translate our data to their dimension ids). Here I'll show how we'd populate a simplified version of your fact table

SELECT O.Price AS UnitPrice, BO.OrderDate AS [Date], 1 AS Quantity, 0 AS Discount, CONCAT(C.FirstName, ' ', C.LastName) AS CustomerName
FROM stage_d1.Ordering as O 
INNER JOIN stage_d1.Book_Order AS BO 
    ON BO.OrderID = O.OrderID 
INNER JOIN stage_d1.Customer AS C 
    ON C.CustomerID = BO.Cus_CustomerID;

That's my source query. The customer lookup will continue to match on Dim_Customer's CustomerName but this time we'll retrieve the CustomerID from the lookup component.

The destination then uses the UnitPrice, Date (depends on how you do it), Quantity and Discount directly from our source. The rest of the dimension keys, we populate through our lookup.

Upvotes: 1

NickW
NickW

Reputation: 9768

the standard approach would be to do the following:

  1. Copy your source data into staging tables in your target database
  2. Write the queries, necessary to populate the star schema, against the staging tables

You populate all your dimension tables first and then all your fact tables.

Upvotes: 0

Related Questions