JohnHk
JohnHk

Reputation: 85

C# Efficient de-duplication of single Datatable column's data

I have a Datatable with some data, example as below, and need to de-duplicate any names in the names field by appending [1], [2] etc.

Current code below, works but is slow on large tables. Any tips on the most efficient way of doing this in C# would be appreciated.

Current Table sample:

- ID	Name		X	Y
- 1	John		45	66
- 2	Paul		44	66
- 3	George		88	102
- 4	John		33	90
- 5	John		53	37
- 6	Paul		97	65
- 7	Ringo		01	87
- 8	Ringo		76	65​

Required Table sample:

- ID	Name		X	Y
- 1	John[1]		45	66
- 2	Paul[1]		44	66
- 3	George		88	102
- 4	John[2]		33	90
- 5	John[3]		53	37
- 6	Paul[2]		97	65
- 7	Ringo[1]	01	87
- 8	Ringo[2]	76	65​

Current code below:

foreach (DataRow aRow in ds.Tables[0].Rows) // run through all
   {
   string aName = aRow["Name"].ToString();                
   DataRow[] FoundRows = ds.Tables[0].Select("Name = '" + aName +"'"); // Find all rows with same name

    if (FoundRows.Length > 1) // As will always find itself
       {
          int i = 1;
          foreach (DataRow row in FoundRows)
            {
             row["Name"] = row["Name"].ToString() + "[" + i + "]";
              i++;
            }
        ds.Tables[0].AcceptChanges(); // Ensure the rows are updated before looping around.
        }
    }

Upvotes: 0

Views: 65

Answers (2)

Dmitry Egorov
Dmitry Egorov

Reputation: 9650

Probably old good for loop updating the whole table in one pass will be the fastest approach:

var foundNames = new Dictionary<string, int>();
for (int rowInd = 0; rowInd < dataTable.Rows.Count; rowInd++)
{
    // If name is not yet found in foundNames, then store its row
    // index. Don't update the dataTable yet -- this is the only
    // occurrence so far.
    // The index is stored inverted to distinguish from count.
    //
    // If name is found in foundNames, retrieve the count.
    // If count is inverted (non-positive), then we've encountered
    // the name second time. In this case update the row with the
    // first occurrence and the current row too. Store the count of 2.
    //
    // If count is positive, then it's third or even later occurrence.
    // Update the current row only and store the incremented count.

    var name = dataTable.Rows[rowInd]["Name"].ToString();
    int count;
    if (!foundNames.TryGetValue(name, out count))
        foundNames.Add(name, -rowInd);
    else
    {
        if (count <= 0)
        {
            dataTable.Rows[-count]["Name"] = name + "[1]";
            count = 1;
        }
        count++;
        dataTable.Rows[rowInd]["Name"] = name + "[" + count + "]";
        foundNames[name] = count;
    }
}

Upvotes: 0

Nino
Nino

Reputation: 7115

Here is one approach

DataTable table = new DataTable();
//test data
table.Columns.Add("Name");
table.Columns.Add("X", typeof(int));
table.Rows.Add(new object[] { "john", 10 });
table.Rows.Add(new object[] { "paul", 44 });
table.Rows.Add(new object[] { "ringo", 312 });
table.Rows.Add(new object[] { "george", 30 });
table.Rows.Add(new object[] { "john", 100 });
table.Rows.Add(new object[] { "paul", 443 });

//converting DataTable to enumerable collection of rows and then grouping by name, 
//skipping groups with only one row(such as george or ringo)
var groupedData = table.AsEnumerable().GroupBy(row => row[0].ToString()).Where(g => g.Count() > 1);

//iterate through each group of <string, DataRow>
foreach (var group in groupedData)
{
    int counter = 1; //counter for "[x]" suffix
    //iterate through all rows under one name, eg. John
    foreach (var groupedItem in group)
    {
        //add [x] 
        groupedItem[0] = string.Format("{0} [{1}]", group.Key, counter);
        counter++;
    }
}

EDIT: simplified code and made it a bit more efficient, as suggested by AdrianWragg

Upvotes: 1

Related Questions