Reputation: 1147
I have a (bit large) flat file (csv). Which I am trying to import in my SQL Server table using SSIS Package. There is nothing special, its a plain import. The problem is, more than 50% of the lines are duplicate.
E.g. Data:
Item Number | Item Name | Update Date
ITEM-01 | First Item | 1-Jan-2013
ITEM-01 | First Item | 5-Jan-2013
ITEM-24 | Another Item | 12-Mar-2012
ITEM-24 | Another Item | 13-Mar-2012
ITEM-24 | Another Item | 14-Mar-2012
Now I need to create my Master Item record table using this data, as you can see the data is duplicate due to the Update Date. This is guaranteed that file will always be sorted by Item Number. So what I need to do is just to check if next item number = previous item number then do NOT import this line.
I used Sort with Remove Duplicate, in SSIS package, but it is actually trying to sort all the lines which is useless because lines are already sorted. Plus it is taking forever to sort too many lines.
So is there any other way?
Upvotes: 3
Views: 8589
Reputation: 1882
There are a couple of approaches you can take to do this.
Group by Item Number
and Item Name
and then perform an aggregate operation on Update Date
. Based on the logic you mentioned above, the Minimum operation should work. In order to use the Minimum operation, you'll need to convert the Update Date
column to a date (can't perform Minimum on a string). That conversion can be done in a Data Conversion Transformation. Below are the guts of what this would look like:
Essentially, you could implement the logic you mentioned above:
if next item number = previous item number then do NOT import this line
First, you must configure the Script Component appropriately (the steps below assume that you don't rename the default input and output names):
Add the Script Component after the Flat File Source in your Data Flow:
Under Input Columns, select all columns:
Under Inputs and Outputs, select Output 0, and set the SynchronousInputID
property to None
Now manually add columns to Output 0 to match the columns in Input 0 (don't forget to set the data types):
Input0_ProcessInputRow
- modify it as below and add a private field named previousItemNumber
as below: public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.ItemNumber.Equals(previousItemNumber))
{
Output0Buffer.AddRow();
Output0Buffer.ItemName = Row.ItemName;
Output0Buffer.ItemNumber = Row.ItemNumber;
Output0Buffer.UpdateDate = Row.UpdateDate;
}
previousItemNumber = Row.ItemNumber;
}
private string previousItemNumber = string.Empty;
Upvotes: 5
Reputation: 486
If performance is a biggy for you I'd suggest you to dump the entire text file into a temporary table on SQL Server and then use a SELECT DISTINCT *
to get the desired values.
Upvotes: 0