C# - Looking for the list of duplicated rows (need optimization)

Question

Please, I would like to optimize this code in C#, if possible.

When there are less than 1000 lines, it's fine. But when we have at least 10000, it starts to take some time... Here a little benchmark :

5000 lines => ~2s
15000 lines => ~20s
25000 lines => ~50s

Indeed, I'm looking for duplicated lines.

Method SequenceEqual to check values may be a problem (in my "benchmark", I have 4 fields considered as "keyField" ...).

Here is the code :

private List GetDuplicateKeys(DataTable table, List keyFields)
{
    Dictionary, int> keys = new Dictionary, int>(); // List of key values + their index in table
    List> duplicatedKeys = new List>(); // List of duplicated keys values 

    List duplicatedRows = new List(); // Rows that are duplicated

    foreach (DataRow row in table.Rows)
    {
        // Find keys fields values for the row
        List

t3chb0t · Accepted Answer

I think this is the fastest and shortest way to find duplicate rows:

For 100.000 rows it executes in about 250ms.

Main and test data:

static void Main(string[] args)
{
    var dt = new DataTable();
    dt.Columns.Add("Id");
    dt.Columns.Add("Value1");
    dt.Columns.Add("Value2");

    var rnd = new Random(DateTime.Now.Millisecond);
    for (int i = 0; i < 100000; i++)
    {
        var dr = dt.NewRow();
        dr[0] = rnd.Next(1, 1000);
        dr[1] = rnd.Next(1, 1000);
        dr[2] = rnd.Next(1, 1000);
        dt.Rows.Add(dr);
    }

    Stopwatch sw = new Stopwatch();
    sw.Start();
    var duplicates = GetDuplicateRows(dt, "Id", "Value1", "Value2");
    sw.Stop();
    Console.WriteLine(
        "Found {0} duplicates in {1} miliseconds.", 
        duplicates.Count,
        sw.ElapsedMilliseconds);        
    Console.ReadKey();
}

GetDuplicateRows with LINQ:

private static List GetDuplicateRows(DataTable table, params string[] keys)
{
    var duplicates =
        table
        .AsEnumerable()
        .GroupBy(dr => String.Join("-", keys.Select(k => dr[k])), (groupKey, groupRows) => new { Key = groupKey, Rows = groupRows })
        .Where(g => g.Rows.Count() > 1)
        .SelectMany(g => g.Rows)
        .ToList();

    return duplicates;
}

Explanation (for those who are new to LINQ):

The most tricky part is the GroupBy I guess. Here I take as the first parameter a DataRow and for each row I create a group key from the values for the specified keys that I join to create a string like 1-1-2. Then the second parameter just selects the group key and the group rows into a new anonymous object. Then I check if there is more then 1 row and flatten the groups back into a list with SelectMany.

C# - Looking for the list of duplicated rows (need optimization)

Answers (2)

Related Questions