user1043535
user1043535

Reputation:

In SSIS Data Flow, is there a better way to make distinct rows for all columns than "Sort"?

In SSIS Data Flow, is there a better way to make distinct rows for all columns than "Sort"?

Below are details, but the essence of what I'm asking is whether there is a better way to make my set distinct after I union the four Flat Files.

I have an SSIS project where I read four different tab-delimited text files, union them, and insert them into a database table. I have been led to believe that "Sort" is the only way to make the rows distinct before I insert them into the database table.

In an ideal world, I would be dealing with unique rows, but the reality is that I don't have unique rows, and I cannot clean up the source data before it gets to me.

While my project works, I am convinced there must be a better way to do this -- and by better, I mean not only in terms of performance, but also better in the sense that it is simpler and more direct for me to set up in SSIS.

The examples of "Sort" I see on the web selects only a few columns of a much larger table, and the option "Remove rows with duplicate sort values" -- if taken literally -- would only check for duplication in those selected columns. So to make sure I'm not missing anything, I sort by all columns. Unfortunately, I don't need any of my columns sorted -- I handle that on the client side. So, there is a performance hit.

To give you a sense of scale, here's a screenshot while I'm in debug mode. No errors, so at least I've gotten that far.

Thank you for any ideas.

SSIS Diagram of Data Flow in Debug Mode

And here's a screenshot of the "Sort" box so you can see the number of columns involved and how I am currently making my set distinct before insertion into the Database table.

Sort to Make Distinct

Upvotes: 3

Views: 20503

Answers (3)

Greg the Incredulous
Greg the Incredulous

Reputation: 1846

You could use the Aggregate Task within the data flow. It doesn't sort, just groups by or sums, averages, etc, which from your comments is what you're looking for.

NOTE: There will be a performance hit regardless of the method you choose, but this will do a distinct rather than a sort.

Upvotes: 2

user1043535
user1043535

Reputation:

Here is another approach that I hadn't considered when I asked the question: on the "Control Flow" tab, after the Data Flow tasks are done, start a SQL Task that reads the table into a temp table, clears the table, and then inserts only unique rows.

This approach doesn't waste time sorting the data. It also makes me sure that I'm getting truly distinct rows in the way I expect them to be. It's probably not the best way and no doubt is making a few people cringe, but it is an approach that I am able to implement with my current skill set. All because Microsoft didn't put a "Distinct union" option in the Data Flow taskbar.

Screenshot of Control Flow tab

Update: By Kyle Hale's suggestion, I improved the T-SQL query (identify and remove specific duplicates), which eliminates the need for a temp table.

WITH Dup_Rows (
    Rec_Nbr,
    Campus,
    Student_ID,
    Student_Name,
    Course_ID,
    Course_Title,
    Credit,
    Date_Earned,
    Department,
    Final_Mark,
    Grade_Level,
    School,
    Teacher,
    [Transfer]
    )
AS (
    SELECT ROW_NUMBER() OVER (
            PARTITION BY c.Campus,
            c.Student_ID,
            c.Student_Name,
            c.Course_ID,
            c.Course_Title,
            c.Credit,
            c.Date_Earned,
            c.Department,
            c.Final_Mark,
            c.Grade_Level,
            c.School,
            c.Teacher,
            c.[Transfer] ORDER BY c.Campus,
                c.Student_ID,
                c.Student_Name,
                c.Course_ID,
                c.Course_Title,
                c.Credit,
                c.Date_Earned,
                c.Department,
                c.Final_Mark,
                c.Grade_Level,
                c.School,
                c.Teacher,
                c.[Transfer]
            ) AS Rec_Nbr,
        c.Campus,
        c.Student_ID,
        c.Student_Name,
        c.Course_ID,
        c.Course_Title,
        c.Credit,
        c.Date_Earned,
        c.Department,
        c.Final_Mark,
        c.Grade_Level,
        c.School,
        c.Teacher,
        c.[Transfer]
    FROM dbo.Base__credits c
    )
DELETE
FROM Dup_Rows
WHERE Dup_Rows.Rec_Nbr > 1

Upvotes: 2

Kyle Hale
Kyle Hale

Reputation: 8120

If your requirement is "identify and remove duplicates before it reaches the destination", you can see how this can only be accomplished by use of some sort of temporary holding structure, since SSIS has to pull in every single value of the files to know if any given row is a duplicate.

My personal suggestion would be to write a Script Task to remove duplicates at a row level from your files.

  1. Create a Script Task called "Clean TSV Files".
  2. Create an array in your script called Checksums.
  3. In your script, open each TSV file and read through it in a loop, one line at a time.
  4. Call String.GetHashCode() on the row.
  5. If the value is in Checksums, toss out the row, it's a dupe.
  6. If not, store the resulting value in Checksums, and pass the row to some other File object.
  7. Save the new File object somewhere SSIS can see.

Now you'll have 4 files that have been cleaned and are ready to just be unioned and inserted into your database.

If for whatever reason you don't want to do file modification and want to keep everything in memory, you could also write the results to a Recordset Destination and then use that as an ADO object in a Script Source, but that is potentially memory intensive.

Upvotes: 0

Related Questions