BakersCat
BakersCat

Reputation: 33

Flattening relational tables into a single flat table

I am trying to figure out the most optimal way to flatten a series of tables into one large flat table for the purpose of creating a single extract (business reasons dictate it has to be a single final table)

We have 15 tables that have around 25 million records each with almost 200 columns in total that need merging into one table. Thankfully this is a one off task.

I believe we have a star schema in data warehousing terms, where there is a master system table that holds each record once with a business Record_ID, and then the rest of the tables are a mix, some only hold each Business Record_ID once, while others will hold a Business Record_ID multiple times. Of the tables with multiple Business Record_IDs, to flatten them we've taken the decision to take the first 4/10/12 repetitions (depending on which table it is) and pivot them into columns. This unfortunately means the final table has 500 columns that need populating.

Now my first attempt was to create a view around an inner join, and combine all of these tables with the master system table on the Business Record_ID. However this proved to be very resource intensive and after 5 hours the DBA killed it. Turns out trying to inner join 15 tables with 25 million records each (as well as pivoting some of them) is very very bad.

I next turned to SSIS, and looked at perhaps using the MERGE JOIN transformation but it only really works with 2 tables at a time. I played around with this idea a bit and I'd have to create a 2nd table to achieve this one table at a time, something like:

  1. TBL1 + TBL2 ---MERGE JOIN---> FLAT_TABLE_1
  2. FLAT_TABLE_1 + TBL3 ---MERGE JOIN---> FLAT_TABLE_2
  3. TRUNCATE FLAT_TABLE_1
  4. FLAT_TABLE_2 + TBL4 ---MERGE JOIN---> FLAT_TABLE_1
  5. TRUNCATE FLAT_TABLE_2
  6. FLAT_TABLE_1 + TBL5 ---MERGE JOIN---> FLAT_TABLE_2

My final thought was to simply use TSQL scripts and write directly to the flat table using an initial INSERT of the system table as the base and then using the UPDATE function to write all the columns from the other 14 tables one by one.

I know TSQL has the MERGE statement but having looked online, performance wise it might be best to stay with INSERT/UPDATEs.

Any way, I hope I've explained my problem adequately and if there is anything else I need to add/explain please let me know.

Edit: We're using SQL Server/SSIS 2016

Upvotes: 1

Views: 1228

Answers (1)

Tab Alleman
Tab Alleman

Reputation: 31795

My final thought was to simply use TSQL scripts and write directly to the flat table using an initial INSERT of the system table as the base and then using the UPDATE function to write all the columns from the other 14 tables one by one.

This is the approach I would use, considering the size of the tables involved. Be sure to do your UPDATEs in batches, instead of all rows at once.

Upvotes: 1

Related Questions