n4rzul
n4rzul

Reputation: 4078

remove duplicates in a datatable

I have the following implementation to find duplicates in a DataTable . It is highly inefficient and takes forever on about 20K rows. I only have to find duplicate entries for the second column values:

private List<string> checkForDuplicates(DataTable results)
{
    List<string> duplicateLists = new List<string>();
    for (int i = 0; i < results.Rows.Count; i++ )
    {
        string cellvalue = results.Rows[i][1].ToString();
        for (int j = 0; j < results.Rows.Count; j++)
        {
            if (i != j)
            {
             if (cellvalue.Equals(results.Rows[j][1]))
                {
                    //Duplicate found                            
                    duplicateLists.Add(results.Rows[i][1].ToString() + "_" + i+2 + "_" + j+2);
                }
            }
        }

    }
    return duplicateLists;
}

Upvotes: 1

Views: 2363

Answers (5)

Matthias
Matthias

Reputation: 1032

Use a Dictionary and iterate once over all values and count the occurence of each value => Dictionary key is the column value, Dictionary value is the count. Then return all keys where count is more than one.

Upvotes: 1

NikoRoberts
NikoRoberts

Reputation: 789

SQL would be much more efficient way of doing this rather than pulling the entire dataset twice.

You can do it very quickly if you have an index on the column you are referring to.

Just do

SELECT id AS matchID, column1 FROM table1 WHERE column1 IN (SELECT column1 FROM table1 WHERE id IS NOT matchId)

or something like that

Cheers, Niko

Upvotes: 0

Jon Egerton
Jon Egerton

Reputation: 41579

The problem you've got is that every row has to check every other row, so with more rows the number of checks goes up exponentially. The quickest way to handle it is to make it linear - only do as many checks as there are rows.

One way to do this is to sort the data table by column2. This will put any duplicates in adjacent rows, so then you just need to run through the table checking that one row doesn't match the next one.

The other way is to get things at source and make sure the rows are distinct before you read them.

Upvotes: 1

mdm
mdm

Reputation: 12630

One optimisation you could make is to do the de-duplication on a sorted data set. Define a DataView which sorts the data on the relevant column, then simply check that the current row's value is not the same as the previous row's value.

Mark Sowul's answer might be a better idea if you aren't bothered about physically removing the rows however.

Upvotes: 1

Mark Sowul
Mark Sowul

Reputation: 10610

From: http://social.msdn.microsoft.com/Forums/en/adodotnetdataset/thread/ed9c6a6a-a93e-4bf5-a892-d8471b84aa3b

DataTable distinctTable = originalTable.DefaultView.ToTable( /*distinct*/ true);

For your purposes you could make a DataView that includes only the column(s) you're interested in.

Upvotes: 0

Related Questions