RLH
RLH

Reputation: 15698

How to create a Primary Key on quasi-unique data keys?

I have a nightly SSIS process that exports a TON of data from an AS400 database system. Due to bugs in the AS400 DB software, ocassional duplicate keys are inserted into data tables. Every time a new duplicate is added to an AS400 table, it kills my nightly export process. This issue has moved from being a nuisance to a problem.

What I need is to have an option to insert only unique data. If there are duplicates, select the first encountered row of the duplicate rows. Is there SQL Syntax available that could help me do this? I know of the DISTINCT ROW clause but that doesn't work in my case because for most of the offending records, the entirety of the data is non-unique except for the fields which comprise the PK.

In my case, it is more important for my primary keys to remain unique in my SQL Server DB cache, rather than having a full snapshot of data. Is there something I can do to force this constraint on the export in SSIS/SQL Server with out crashing the process?

EDIT

Let me further clarify my request. What I need is to assure that the data in my exported SQL Server tables maintains the same keys that are maintained the AS400 data tables. In other words, creating a unique Row Count identifier wouldn't work, nor would inserting all of the data without a primary key.

If a bug in the AS400 software allows for mistaken, duplicate PKs, I want to either ignore those rows or, preferably, just select one of the rows with the duplicate key but not both of them.

This SELECT statement should probably happen from the SELECT statement in my SSIS project which connects to the mainframe through an ODBC connection.

I suspect that there may not be a "simple" solution to my problem. I'm hoping, however, that I'm wrong.

Upvotes: 2

Views: 939

Answers (4)

HLGEM
HLGEM

Reputation: 96570

If I understand you correctly, you have duplicated PKs that have different data in the other fields.

First, put the data from the other database into a staging table. I find it easier to research issues with imports (especially large ones) if I do this. Actually I use two staging tables (and for this case I strongly recommend it), one with the raw data and one with only the data I intend to import into my system.

Now you can use and Execute SQL task to grab the one of the records for each key (see @Quassnoi for an idea of how to do that you may need to adjust his query for your situation). Personally I put an identity into my staging table, so I can identify which is the first or last occurance of duplicated data. Then put the record you chose for each key into your second staging table. If you are using an exception table, copy the records you are not moving to it and don't forget a reason code for the exception ("Duplicated key" for instance).

Now that you have only one record per key in a staging table, your next task is to decide what to do about the other data that is not unique. If there are two different business addresses for the same customer, which do you chose? This is a matter of business rules definition not strictly speaking SSIS or SQL code. You must define the business rules for how you chose the data when the data needs to be merged between two records (what you are doing is the equivalent of a de-dupping process). If you are lucky there is a date field or other way to determine which is the newest or oldest data and that is the data they want you to use. In that case once you have selected just one record, you are done the intial transform.

More than likely though you may need different rules for each other field to choose the correct one. In this case you write SSIS transforms in a data flow or Exec SQl tasks to pick the correct data and update the staging table.

Once you have the exact record you want to import, then do the data flow to move to the correct production tables.

Upvotes: 0

user756519
user756519

Reputation:

Since you are using SSIS, you must be using OLE DB Source to fetch the data from AS400 and you will be using OLE DB Destination to insert data into SQL Server.

Let's assume that you don't have any transformations

Data Flow Task

Add a Sort transformation after the OLE DB Source. In the Sort Transformation, there is a check box option at the bottom to remove duplicate rows based on a give set of column values. Check all the fields but don't select the Primary Key that comes from AS400. This will eliminate the duplicate rows but will insert the data that you still need.

Sort

I hope that is what you are looking for.

Upvotes: 2

Code Magician
Code Magician

Reputation: 23972

There are several options.

If you use IGNORE_DUP_KEY (http://www.sqlservernation.com/home/creating-indexes-with-ignore_dup_key.html) option on your primary key, SQL will issue a warning and only the duplicate records will fail.

You can also group/roll-up your data but this can get very expensive. What I mean by that is:

SELECT Id, MAX(value1), MAX(value2), MAX(value3) etc

Another option is to add an identity column (and cluster on this for an efficient join later) to your staging table and then create a mapping in a temp table. The mapping table would be:

CREATE TABLE #mapping 
( 
    RowID INT PRIMARY KEY CLUSTERED,
    PKIN INT
)

INSERT INTO #mapping
SELECT PKID, MIN(rowid) FROM staging_table  
GROUP BY PKID

INSERT INTO presentation_table
SELECT S.* 
FROM Staging_table S 
    INNER JOIN #mapping M 
        ON S.RowID = M.RowID 

Upvotes: 1

Quassnoi
Quassnoi

Reputation: 425411

In SQL Server 2005 and above:

SELECT  *
FROM    (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY almost_unique_field ORDER BY id) rn
        FROM    import_table
        ) q
WHERE   rn = 1

Upvotes: 1

Related Questions