William Ji Carpenter
William Ji Carpenter

Reputation: 189

SSIS - Split multiple columns into single rows

Below is a very small example of the flat table I need to split. The first table is a Lesson table which has the ID, Name and Duration. The second table is a student table which only has the Student Name as a PK. And the third table will be a Many to Many of Lesson ID and Student Name.

Lesson Id Lesson Name Lesson Duration Student1 Student2 Student3 Student4
1 Maths 1 Hour Jean Paul Jane Doe
2 English 1 Hour Jean Jane Doe
  1. I don't know how, using SSIS, I can assign Jean, Paul, Jane and Doe to their own tables using the Student 1, 2, 3 and 4 columns. When I figure this out, I imagine I can use the same logic to map the Lesson ID and columns to the third Many to Many table?
  2. How do I handle duplicate entries, for example Jean Jane and Doe already exist from the first row so they do not need to be added to the Students table.
  3. I assume I use a conditional split to skip null values? For example Student4 on the second row is Null.

Thanks for the assistance.

Upvotes: 0

Views: 297

Answers (1)

billinkc
billinkc

Reputation: 61211

Were it me, I would design this as 3 data flows.

Data flow 1 - student population

Since we're assuming the name is what makes a student unique, we need to build a big list of the unique names.

SELECT D.*
FROM
(
    SELECT S.Student1 AS StudentName
    FROM dbo.MyTable AS S
    UNION
    SELECT S.Student2 AS StudentName
    FROM dbo.MyTable AS S
    UNION
    SELECT S.Student3 AS StudentName
    FROM dbo.MyTable AS S
    UNION
    SELECT S.Student4 AS StudentName
    FROM dbo.MyTable AS S
)D
WHERE D.StudentName IS NOT NULL
ORDER BY D.StudentName;

The use of UNION in the query will handle deduplication of data and we wrap that in a derived table to filter the NULLs.

I add an explicit order by not that it's needed but since I'm assuming you're using the name as the primary key, let's avoid sort operation when we land the data.

Add an OLE DB Source to your data flow and instead of picking a table in the drop down, you'll use the above query.

Add an OLE DB Destination to the same data flow and connect the two. Assuming your target table looks something like

CREATE TABLE dbo.Student
(
    StudentName varchar(50) NOT NULL CONSTRAINT PK__dbo__Student PRIMARY KEY(StudentName)
);

Data Flow 2 - Lessons

Dealers choice here, you can either write the query or just point at the source table.

A very good practice to get into with SSIS is to only bring the data you need into the buffers so I would write a query like

SELECT DISTINCT S.[Lesson Id], S.[Lesson Name], S.[Lesson Duration]
FROM dbo.MyTable AS S;

I favor a distinct here as I don't know enough about your data but if it were extended and a second Maths class was offered to accommodate another 4 students, it might be Lesson Id 1 again. Or it might be 3 as it indicates course time or something else.

Add an OLE DB Destination and land the data.

Data Flow 3 - Many to Many

There's a few different ways to handle this. I'd favor the lazy way and repeat our approach from the first data flow

SELECT D.*
FROM
(
    SELECT S.Student1 AS StudentName, S.[Lesson Id]
    FROM dbo.MyTable AS S
    UNION
    SELECT S.Student2 AS StudentName, S.[Lesson Id]
    FROM dbo.MyTable AS S
    UNION
    SELECT S.Student3 AS StudentName, S.[Lesson Id]
    FROM dbo.MyTable AS S
    UNION
    SELECT S.Student4 AS StudentName, S.[Lesson Id]
    FROM dbo.MyTable AS S
)D
WHERE D.StudentName IS NOT NULL
ORDER BY D.StudentName;

And then land in your bridge table with an OLE DB Destination and be done with it.

If this has been homework/an assignment to have you learn the native components...

Do keep with the 3 data flow approach. Trying to do too much in one go is a recipe for trouble.

The operation of moving wide data to narrow data is an Unpivot operation. You'd use that in the Student and bridge table data flows but honestly, I think I've used that component less than 10 times in my career and/or answering SSIS questions here and I do a lot of that.

If the Unpivot operation generates a NULL, then yes, you'd likely want to use a Conditional Split to filter those rows out.

If your reference tables were more complex, then you'd likely be adding a Lookup component to your bridge table population step to retrieve the surrogate key.

Upvotes: 1

Related Questions