Reputation: 4686
I have a flat file, it looks like this
┌───────┬──────────┬──────────┐ │ ITEM │ QTY_ON_HAND LOCATION│ ├───────┼──────────┼──────────┤ │ BOX │ 1 │ A │ │ BOX │ 0 │ B │ │ THING │ 1 │ C │ │ OTHER │ 2 │ A │ └───────┴──────────┴──────────┘
ITEM can have duplicates, I need to keep the one where QTY_ONHAND is highest (first BOX in example above), if both have the same value, then I would keep any occurrence.
Approaches:
Using Sort + Remove rows with duplicate sort values. Sort will delete random occurrences, I can't guarantee that the one with highest value will remain
Script Component. I can detect duplicates in a Script Component similar to this answer but I'd have to store all the rows in memory, compare each row against all the stored set of rows, keep the highest, remove the duplicates and somehow return only the valid rows, sounds very inefficient, however, there are not too many rows so I'll consider it (I'm still not sure how to achieve this).
Aggregate Transformation. I understand I can group the rows by ITEM and use the MAX operation to keep the one with highest value, this is the one I'm trying to do right now, but I'm stuck.
Sources:
SSIS: Flat File Source to SQL without Duplicate Rows
Remove duplicate in SSIS package with preference over a column data
Upvotes: 1
Views: 3806
Reputation: 373
In SSIS to do this directly from a flat file I think is tricky, as an SSIS aggregate step will not produce the correct result on its own (either producing multiple rows or failing to identify the correct location)
However, it can can be done in a SQL db without much effort, using a partial left join and then filtering for the topmost row. SSIS can be used to import the data to a staging table first and run the query from there.
The SQL is:
select
i1.item,
i1.quantity,
i1.location
from
item as i1
left outer join item as i2 on i1.item = i2.item and i1.quantity<i2.quantity
where
i2.quantity is null
This stage/query approach may be better if you subsequently reuse the data for further processing.
If there is a large amount of data it would also be worth considering whether indexing the data after load improves performance, particularly if the flat file itself is already sorted and that aspect of index build can be skipped.
There are a couple of other ways of deriving this in the db, but this one is fairly efficient and integrates well into more complex queries.
Upvotes: 1
Reputation: 4686
The solution is straightforward, you just add an Aggregate Transformation
Group by all the other columns you know are similar for all the rows in the group, and apply the Maximum operation for the column you want to use to get the row with the highest value.
I'm not grouping by Location, nor any other column with values I know will be different for each row, I don't need Location further in the flow, but if I did, then I suppose I could use more Aggregate Transformations I don't know.
Upvotes: 1