Reputation: 20935
I am performing ETL on a set of Office
, Employee
, Location
tables by following the standard practice of bringing all the data into Staging tables, first by way of Extract packages, and then performing Transform-Load on each of the staging table to get the data into the respective table(s).
In each of my Transform-Load SSIS Packages, I am performing CUD ( Create, Update, Delete) by using MERGE JOIN
and CONDITIONAL
Splits.
This works fine when the data in Staging table is 1-to-1 with the corresponding real table. In the scenario below (see image) the SampleLocation
table is 1-to-1 with the StageSampleLocation
table.
The trouble I am having is deciding how to handle a situation where the Staging table has data that will go into the Foreign key table(s).
If you take a look at the following database diagram...
The data from StageSampleOffice
goes into SampleOffice
for fields that have Office data. In addition to office data, the StageSampleOffice
has Person data -- in this example, OfficeManagerName
field will need to be looked up in the FK table SamplePerson
. If the name doesn't exist in SamplePerson
table, this name will need to be inserted in SamplePerson
first, and the PersonId
PK value for that person will be retrieved and stored as the FK value in the row for the imported Office in the SampleOffice
table, in my Data Flow Task.
Similarly, for the address info in StageSampleOffice
, the details will need to be looked up in the SampleLocation
FK table, and if the address doesn't exist, a new one needs to be inserted with the corresponding values from StageSampleOffice
. Once that is done, the LocationId
for the address will be stored as FK in the SampleOffice
table.
As you can see, data for SampleLocation
and SamplePerson
could come into the system from 2 or more sources. In the example above, for SampleLocation
I get a Location data file that only has the addresses. I also get addresses as part of Office records from various office types, that come in the StageSampleOffice
table.
I have already separated the Extract workflows from the Transform-Load workflows. I have 1 extract package per staging table, that essentially reads data from the source (flatfile or table), truncates the staging table, and imports everything as-is into the staging table.
I am thinking that
StageSampleOffice
, I will first insert the data into the StageSamplePerson
table (not shown in diagram) and then execute the Transform-Load Package for SamplePerson
that will do the Create or Update for those Persons andStageSampleOffice
, I will first insert the data into the StageSampleLocation
table (not shown in diagram) and then execute the Transform-Load Package for SampleLocation
that will do the Create or Update for those Location.This way, all the FK rows will be present in the respective table when the flow returns to my main package that does the Transform-Load for the SampleOffice
table.
Is this a good idea, or is there a better way.
Thank you!
Upvotes: 1
Views: 544
Reputation: 31775
Seems like "6 of one, half-dozen of the other" to me.
Either way you're eventually checking every person and every location to see if it is already in the final destination table and doing an insert or not.
Whether you "pre-condense" them in the staging tables or not the workload will be the same. I would go with the approach that seems more intuitive to you, because that will be the one that you will find more maintainable in the future.
Upvotes: 1