Reputation: 11
For a school assignment we have to design and fill a data warehouse for a fictional shipping company. I made a PSA database with all the "dirty" data, and am now in the process of designing the ETL process. The whole process needs to be made in SSIS and besides some basic tutorials I have very limited knowledge of this program.
I am stuck on a process of cleansing a column in a table where some countries are written incorrectly. I made a reference table with in one column the grammatical errors and in the other column the correct format of the country name. Here you can see a snippet of the knowledge base table:
and here you can see a snippet of the table with the "dirty" data:
How can I search and replace the faulty data in the "Country" column with the correct data from the knowledge base table in the SSIS data flow? I don't think I can use the fuzzy lookup because the faulty data contains abbreviations of country names.
We can't use Microsoft DQS because that would make things too easy for us I guess...
Thanks in advance!
Upvotes: 0
Views: 360
Reputation: 14199
First make sure that the Error
column is unique. You might want to remove case sensitive variations and declare the column with a collate that is case insensitive, so you don't have to add all variations of capital letters for each possible error.
-- "CI" means case insensitive, "AI" means ascent insensitive
ALTER TABLE CountryVariations ALTER COLUMN Error VARCHAR(100) COLLATE Modern_Spanish_CI_AI
Remove already inserted duplicates:
;WITH Duplicates AS
(
SELECT
C.Error,
C.Country,
RepeatRanking = ROW_NUMBER() OVER (PARTITION BY Error ORDER BY Error)
FROM
CountryVariations AS C
)
DELETE D FROM
Duplicates AS D
WHERE
D.RepeatRanking > 1
Then use a classic LookUp component in your SSIS package, joining your source's Country
value with the Error
column in the CountryVariations
table, and return the corrected standard Country
column. You will have to consider what to do when you don't have a matching variation in your table (redirect to error flow, fail package, assign a default country, etc.).
Upvotes: 0
Reputation: 31785
Sounds to me like you're overthinking this. Use a standard lookup and get the correction from the knowledge base table, if one exists, and use the correction if there is one, or the original country name if there isn't. There's no fuzzy lookup, there's no LIKE operator, just an equality check.
Upvotes: 2