sallushan
sallushan

Reputation: 1147

SSIS: Flat File Source to SQL without Duplicate Rows

I have a (bit large) flat file (csv). Which I am trying to import in my SQL Server table using SSIS Package. There is nothing special, its a plain import. The problem is, more than 50% of the lines are duplicate.

E.g. Data:

Item Number    |    Item Name     |     Update Date
ITEM-01        | First Item       | 1-Jan-2013
ITEM-01        | First Item       | 5-Jan-2013
ITEM-24        | Another Item     | 12-Mar-2012
ITEM-24        | Another Item     | 13-Mar-2012
ITEM-24        | Another Item     | 14-Mar-2012

Now I need to create my Master Item record table using this data, as you can see the data is duplicate due to the Update Date. This is guaranteed that file will always be sorted by Item Number. So what I need to do is just to check if next item number = previous item number then do NOT import this line.

I used Sort with Remove Duplicate, in SSIS package, but it is actually trying to sort all the lines which is useless because lines are already sorted. Plus it is taking forever to sort too many lines.

So is there any other way?

Upvotes: 3

Views: 8589

Answers (2)

gannaway
gannaway

Reputation: 1882

There are a couple of approaches you can take to do this.

1. Aggregate Transformation

Group by Item Number and Item Name and then perform an aggregate operation on Update Date. Based on the logic you mentioned above, the Minimum operation should work. In order to use the Minimum operation, you'll need to convert the Update Date column to a date (can't perform Minimum on a string). That conversion can be done in a Data Conversion Transformation. Below are the guts of what this would look like:

enter image description here

2. Script Component Transformation

Essentially, you could implement the logic you mentioned above:

if next item number = previous item number then do NOT import this line

First, you must configure the Script Component appropriately (the steps below assume that you don't rename the default input and output names):

  1. Select Transformation as the Script Component type
  2. Add the Script Component after the Flat File Source in your Data Flow:

    enter image description here

  3. Double Click the Script Component to open the Script Transformation Editor.
  4. Under Input Columns, select all columns:

    enter image description here

  5. Under Inputs and Outputs, select Output 0, and set the SynchronousInputID property to None

    enter image description here

  6. Now manually add columns to Output 0 to match the columns in Input 0 (don't forget to set the data types):

    enter image description here

  7. Finally, edit the script. There will be a method named Input0_ProcessInputRow- modify it as below and add a private field named previousItemNumber as below:
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        if (!Row.ItemNumber.Equals(previousItemNumber))
        {
            Output0Buffer.AddRow();
            Output0Buffer.ItemName = Row.ItemName;
            Output0Buffer.ItemNumber = Row.ItemNumber;
            Output0Buffer.UpdateDate = Row.UpdateDate;
        }  

        previousItemNumber = Row.ItemNumber;
    }

    private string previousItemNumber = string.Empty;

Upvotes: 5

Eli Ekstein
Eli Ekstein

Reputation: 486

If performance is a biggy for you I'd suggest you to dump the entire text file into a temporary table on SQL Server and then use a SELECT DISTINCT * to get the desired values.

Upvotes: 0

Related Questions