sigil
sigil

Reputation: 9546

less expensive way to find duplicate rows in a datatable?

I want to find all rows in a DataTable where each of a group of columns is a duplicate. My current idea is to get a list of indexes of all rows that appear more than once as follows:

public List<int> findDuplicates_New()
        {
            string[] duplicateCheckFields = { "Name", "City" };
            List<int> duplicates = new List<int>();
            List<string> rowStrs = new List<string>();
            string rowStr;

            //convert each datarow to a delimited string and add it to list rowStrs
            foreach (DataRow dr in submissionsList.Rows)
            {
                rowStr = string.Empty;
                foreach (DataColumn dc in submissionsList.Columns)
                {
                    //only use the duplicateCheckFields in the string   
                    if (duplicateCheckFields.Contains(dc.ColumnName))
                    {
                        rowStr += dr[dc].ToString() + "|";
                    }
                }
                rowStrs.Add(rowStr);
            }

            //count how many of each row string are in the list
            //add the string's index (which will match the row's index)
            //to the duplicates list if more than 1
            for (int c = 0; c < rowStrs.Count; c++)
            {
                if (rowStrs.Count(str => str == rowStrs[c]) > 1)
                {
                    duplicates.Add(c);
                }
            }
            return duplicates;
        }

However, this isn't very efficient: it's O(n^2) to go through the list of strings and get the count of each string. I looked at this solution but couldn't figure out how to use it with more than 1 field. I'm looking for a less expensive way to handle this problem.

Upvotes: 0

Views: 2906

Answers (1)

Sten Petrov
Sten Petrov

Reputation: 11040

Try this:

How can I check for an exact match in a table where each row has 70+ columns?

The essence is to make a set where you store hashes for rows and only do comparisons between rows with colliding hashes, complexity will be O(n)

...

If you have a large number of rows and storing the hashes themselves is an issue (an unlikely case, but still...) you can use a Bloom filter. The core idea of a Bloom filter is to calculate several different hashes of each row and use them as an address in a bitmap. As you're scanning through the rows you can double-check the rows that already have all the bits in the bitmap previously set.

Upvotes: 1

Related Questions