I have a flat file, it looks like this ┌───────┬──────────┬──────────┐ │ ITEM │ QTY_ON_HAND LOCATION│ ├───────┼──────────┼──────────┤ │ BOX │ 1 │ A │ │ BOX │ 0 │ B │ │ THING │ 1 │ C │ │ OTHER │ 2 │ A │ └───────┴──────────┴──────────┘ ITEM can have duplicates, I need to keep the one where QTY_ONHAND is highest (first BOX in example above), if both have the same value, then I would keep any occurrence. Approaches: Using Sort + Remove rows with duplicate sort values . Sort will delete random occurrences, I can't guarantee that the one with highest value will remain Script Component . I can detect duplicates in a Script Component similar to this answer but I'd have to store all the rows in memory, compare each row against all the stored set of rows, keep the highest, remove the duplicates and somehow return only the valid rows, sounds very inefficient, however, there are not too many rows so I'll consider it (I'm still not sure how to achieve this). Aggregate Transformation . I understand I can group the rows by ITEM and use the MAX operation to keep the one with highest value, this is the one I'm trying to do right now, but I'm stuck. Sources: SSIS: Flat File Source to SQL without Duplicate Rows https://social.msdn.microsoft.com/Forums/sqlserver/en-US/581df31a-5cd0-4802-a472-7ee85c6e8464/how-does-remove-duplicates-option-in-sort-transformation-behaves?forum=sqlintegrationservices Remove duplicate in SSIS package with preference over a column data

Reputation: 4686

SSIS - Find duplicate rows and keep the one with the highest value

I have a flat file, it looks like this

┌───────┬──────────┬──────────┐
│ ITEM  │ QTY_ON_HAND LOCATION│
├───────┼──────────┼──────────┤
│ BOX   │        1 │ A        │
│ BOX   │        0 │ B        │
│ THING │        1 │ C        │
│ OTHER │        2 │ A        │
└───────┴──────────┴──────────┘

ITEM can have duplicates, I need to keep the one where QTY_ONHAND is highest (first BOX in example above), if both have the same value, then I would keep any occurrence.

Approaches:

Using Sort + Remove rows with duplicate sort values. Sort will delete random occurrences, I can't guarantee that the one with highest value will remain
Script Component. I can detect duplicates in a Script Component similar to this answer but I'd have to store all the rows in memory, compare each row against all the stored set of rows, keep the highest, remove the duplicates and somehow return only the valid rows, sounds very inefficient, however, there are not too many rows so I'll consider it (I'm still not sure how to achieve this).
Aggregate Transformation. I understand I can group the rows by ITEM and use the MAX operation to keep the one with highest value, this is the one I'm trying to do right now, but I'm stuck.

Sources:

SSIS: Flat File Source to SQL without Duplicate Rows

https://social.msdn.microsoft.com/Forums/sqlserver/en-US/581df31a-5cd0-4802-a472-7ee85c6e8464/how-does-remove-duplicates-option-in-sort-transformation-behaves?forum=sqlintegrationservices

Remove duplicate in SSIS package with preference over a column data

Upvotes: 1

Answers (2)

simon coleman

Reputation: 373

In SSIS to do this directly from a flat file I think is tricky, as an SSIS aggregate step will not produce the correct result on its own (either producing multiple rows or failing to identify the correct location)

However, it can can be done in a SQL db without much effort, using a partial left join and then filtering for the topmost row. SSIS can be used to import the data to a staging table first and run the query from there.

The SQL is:

select 
   i1.item,
   i1.quantity,
   i1.location 
from 
   item as i1 
   left outer join item as i2 on i1.item = i2.item and i1.quantity<i2.quantity
where 
   i2.quantity is null

This stage/query approach may be better if you subsequently reuse the data for further processing.

If there is a large amount of data it would also be worth considering whether indexing the data after load improves performance, particularly if the flat file itself is already sorted and that aspect of index build can be skipped.

There are a couple of other ways of deriving this in the db, but this one is fairly efficient and integrates well into more complex queries.

Upvotes: 1

The One

Reputation: 4686

The solution is straightforward, you just add an Aggregate Transformation

Group by all the other columns you know are similar for all the rows in the group, and apply the Maximum operation for the column you want to use to get the row with the highest value.

I'm not grouping by Location, nor any other column with values I know will be different for each row, I don't need Location further in the flow, but if I did, then ~~I suppose I could use more Aggregate Transformations~~ I don't know.

Upvotes: 1

SSIS - Find duplicate rows and keep the one with the highest value

Answers (2)

Related Questions