The One
The One

Reputation: 4686

SSIS - Find duplicate rows and keep the one with the highest value

I have a flat file, it looks like this

┌───────┬──────────┬──────────┐
│ ITEM  │ QTY_ON_HAND LOCATION│
├───────┼──────────┼──────────┤
│ BOX   │        1 │ A        │
│ BOX   │        0 │ B        │
│ THING │        1 │ C        │
│ OTHER │        2 │ A        │
└───────┴──────────┴──────────┘  

ITEM can have duplicates, I need to keep the one where QTY_ONHAND is highest (first BOX in example above), if both have the same value, then I would keep any occurrence.

Approaches:

Sources:

SSIS: Flat File Source to SQL without Duplicate Rows

https://social.msdn.microsoft.com/Forums/sqlserver/en-US/581df31a-5cd0-4802-a472-7ee85c6e8464/how-does-remove-duplicates-option-in-sort-transformation-behaves?forum=sqlintegrationservices

Remove duplicate in SSIS package with preference over a column data

Upvotes: 1

Views: 3806

Answers (2)

simon coleman
simon coleman

Reputation: 373

In SSIS to do this directly from a flat file I think is tricky, as an SSIS aggregate step will not produce the correct result on its own (either producing multiple rows or failing to identify the correct location)

However, it can can be done in a SQL db without much effort, using a partial left join and then filtering for the topmost row. SSIS can be used to import the data to a staging table first and run the query from there.

The SQL is:

select 
   i1.item,
   i1.quantity,
   i1.location 
from 
   item as i1 
   left outer join item as i2 on i1.item = i2.item and i1.quantity<i2.quantity
where 
   i2.quantity is null

This stage/query approach may be better if you subsequently reuse the data for further processing.

If there is a large amount of data it would also be worth considering whether indexing the data after load improves performance, particularly if the flat file itself is already sorted and that aspect of index build can be skipped.

There are a couple of other ways of deriving this in the db, but this one is fairly efficient and integrates well into more complex queries.

Upvotes: 1

The One
The One

Reputation: 4686

The solution is straightforward, you just add an Aggregate Transformation

enter image description here

Group by all the other columns you know are similar for all the rows in the group, and apply the Maximum operation for the column you want to use to get the row with the highest value.

enter image description here

I'm not grouping by Location, nor any other column with values I know will be different for each row, I don't need Location further in the flow, but if I did, then I suppose I could use more Aggregate Transformations I don't know.

Upvotes: 1

Related Questions