Reputation: 4078
I have the following implementation to find duplicates in a DataTable . It is highly inefficient and takes forever on about 20K rows. I only have to find duplicate entries for the second column values:
private List<string> checkForDuplicates(DataTable results)
{
List<string> duplicateLists = new List<string>();
for (int i = 0; i < results.Rows.Count; i++ )
{
string cellvalue = results.Rows[i][1].ToString();
for (int j = 0; j < results.Rows.Count; j++)
{
if (i != j)
{
if (cellvalue.Equals(results.Rows[j][1]))
{
//Duplicate found
duplicateLists.Add(results.Rows[i][1].ToString() + "_" + i+2 + "_" + j+2);
}
}
}
}
return duplicateLists;
}
Upvotes: 1
Views: 2363
Reputation: 1032
Use a Dictionary and iterate once over all values and count the occurence of each value => Dictionary key is the column value, Dictionary value is the count. Then return all keys where count is more than one.
Upvotes: 1
Reputation: 789
SQL would be much more efficient way of doing this rather than pulling the entire dataset twice.
You can do it very quickly if you have an index on the column you are referring to.
Just do
SELECT id AS matchID, column1 FROM table1 WHERE column1 IN (SELECT column1 FROM table1 WHERE id IS NOT matchId)
or something like that
Cheers, Niko
Upvotes: 0
Reputation: 41579
The problem you've got is that every row has to check every other row, so with more rows the number of checks goes up exponentially. The quickest way to handle it is to make it linear - only do as many checks as there are rows.
One way to do this is to sort the data table by column2. This will put any duplicates in adjacent rows, so then you just need to run through the table checking that one row doesn't match the next one.
The other way is to get things at source and make sure the rows are distinct before you read them.
Upvotes: 1
Reputation: 12630
One optimisation you could make is to do the de-duplication on a sorted data set. Define a DataView which sorts the data on the relevant column, then simply check that the current row's value is not the same as the previous row's value.
Mark Sowul's answer might be a better idea if you aren't bothered about physically removing the rows however.
Upvotes: 1
Reputation: 10610
DataTable distinctTable = originalTable.DefaultView.ToTable( /*distinct*/ true);
For your purposes you could make a DataView that includes only the column(s) you're interested in.
Upvotes: 0