user1254579
user1254579

Reputation: 4009

How to implement deduplication in a billion rows table ssis

which is the best option to implement distinct operation in ssis? I have a table with more than 200 columns and contain more than 10 million rows. I need to get the ditinct rows from this table.Is it wise to use a execute sql task (with select query to deduplicate the rows) or is there any other way to achieve this in ssis

I do understood that the ssis sort component deduplicate the rows..but this is a blocking component it is not at all a good idea to use ...Please let me know your views on this

Upvotes: 0

Views: 2310

Answers (3)

Irawan Soetomo
Irawan Soetomo

Reputation: 1325

I had done it in 3 steps this way:

  1. Dump the MillionRow table into HashDump table, which has only 2 columns: Id int identity PK, and Hash varbinary(20). This table shall be indexed on its Hash column.
  2. Dump the HashDump table into HashUni ordered by Hash column. In between would be a Script Component that check whether the current row's Hash column value is same as the previous row. If same, direct row to Duplicate output, else Unique. This way you can log the Duplicate even if what you need is just the Unique.
  3. Dump the MillionRow table into MillionUni table. In between would be a Lookup Component that uses HashUni to tell which row is Unique.

This method allows me to log each duplicates with a message such as: "Row 1000 is a duplicate of row 100".

I have not found a better way than this. Earlier, I made a unique index on MillionUni, to dump directly the MillionRow into it, but I was not able to use "fast load", which was way too slow.

Here is one way to populate the Hash column:

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    StringBuilder sb = new StringBuilder();
    sb.Append(Row.Col1String_IsNull ? "" : Row.Col1String); sb.Append("|");
    sb.Append(Row.Col2Num_IsNull ? "" : Row.Col2Num.ToString()); sb.Append("|");
    sb.Append(Row.Col3Date_IsNull ? "" : Row.Col3Date.ToString("yyyy-MM-dd"));
    var sha1Provider = HashAlgorithm.Create("SHA1"); 
    Row.Hash = sha1Provider.ComputeHash(Encoding.UTF8.GetBytes(sb.ToString()));
}

If 200 columns prove to be a chore for you, part of this article shall inspire you. It is making a loop for the values of all column objects into a single string.

And to compare the Hash, use this method:

byte[] previousHash;
int previousRowNo;

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    if (StructuralComparisons.StructuralEqualityComparer.Equals(Row.Hash, previousHash))
    {
        Row.DupRowNo = previousRowNo;
        Row.DirectRowToDuplicate();
    }
    else
    {
        Row.DirectRowToUnique();
    }
    previousHash = Row.Hash;
    previousRowNo = Row.RowNo;
}

Upvotes: 1

DNac
DNac

Reputation: 2783

As far as I know, the Sort Component is the only transformation which allows you to distinct the duplcities. Or you could use SQL-like command.
If the sorting operation is problem, then you should use (assuming your source is DB) "SQL Command" in Data Access Mode specification. Select distinct your data and that's it .. you may also save a bit time as the ETL wont have to go through the Sort Component.

enter image description here

Upvotes: 0

Brij Raj Singh - MSFT
Brij Raj Singh - MSFT

Reputation: 5113

I won't bother SSIS for it, a couple of queries will do; also you have a lot of data, so i suggest you check the execution plan before running the queries, and optimize your indexes

http://www.brijrajsingh.com/2011/03/delete-duplicate-record-but-keep.html

Check out a small article i wrote on the same topic

Upvotes: 0

Related Questions