J.Brd
J.Brd

Reputation: 11

Data cleansing with a knowledge base in SSIS without DQS

For a school assignment we have to design and fill a data warehouse for a fictional shipping company. I made a PSA database with all the "dirty" data, and am now in the process of designing the ETL process. The whole process needs to be made in SSIS and besides some basic tutorials I have very limited knowledge of this program.

I am stuck on a process of cleansing a column in a table where some countries are written incorrectly. I made a reference table with in one column the grammatical errors and in the other column the correct format of the country name. Here you can see a snippet of the knowledge base table:

enter image description here

and here you can see a snippet of the table with the "dirty" data:

enter image description here

How can I search and replace the faulty data in the "Country" column with the correct data from the knowledge base table in the SSIS data flow? I don't think I can use the fuzzy lookup because the faulty data contains abbreviations of country names.

We can't use Microsoft DQS because that would make things too easy for us I guess...

Thanks in advance!

Upvotes: 0

Views: 360

Answers (2)

EzLo
EzLo

Reputation: 14199

First make sure that the Error column is unique. You might want to remove case sensitive variations and declare the column with a collate that is case insensitive, so you don't have to add all variations of capital letters for each possible error.

-- "CI" means case insensitive, "AI" means ascent insensitive
ALTER TABLE CountryVariations ALTER COLUMN Error VARCHAR(100) COLLATE Modern_Spanish_CI_AI 

Remove already inserted duplicates:

;WITH Duplicates AS
(
    SELECT
        C.Error,
        C.Country,
        RepeatRanking = ROW_NUMBER() OVER (PARTITION BY Error ORDER BY Error)
    FROM
        CountryVariations AS C
)
DELETE D FROM
    Duplicates AS D
WHERE
    D.RepeatRanking > 1

Then use a classic LookUp component in your SSIS package, joining your source's Country value with the Error column in the CountryVariations table, and return the corrected standard Country column. You will have to consider what to do when you don't have a matching variation in your table (redirect to error flow, fail package, assign a default country, etc.).

Upvotes: 0

Tab Alleman
Tab Alleman

Reputation: 31785

Sounds to me like you're overthinking this. Use a standard lookup and get the correction from the knowledge base table, if one exists, and use the correction if there is one, or the original country name if there isn't. There's no fuzzy lookup, there's no LIKE operator, just an equality check.

Upvotes: 2

Related Questions