How do I merge multiple rows with partial duplicated data into one row but keep non-duplicated data?

Question

I got a big datatable (over 300K rows with 40 columns in it) with data, the fragments like these (all value is string) :

colA colB colC ColD ColdE ColF ColG ColH
--------------------------------------------------------
A01 B01 C01 DA1 EA1 FA1 GA1 HA1
A01 B01 C01 DA2 EA2 FA2 GA2 HA2
A02 B02 C02 DA3 EA3 FA3 GA3 HA3
A02 B02 C02 DA4 EA4 FA4 GA4 HA4
A03 B03 C03 DA5 EA5 FA5 GA5 HA5
A04 B04 C04 DA6 EA6 FA6 GA6 HA6

There is partial data duplicated and I would like merge duplicated data by using colA + ColB+ ColC for key and keep ColD ColE ColF, use first row for other columns. The expected result like these:

colA colB colC ColD1 colE1 colF1 colG1 ColD2 colE2 colF2 colG2 ColH
------------------------------------------------
A01 B01 C01 DA1 EA1 FA1 GA1 DA2 EA2 FA2 GA2 HA1
A02 B02 C02 DA3 EA3 FA3 GA3 DA4 EA4 FA4 GA4 HA3
A03 B03 C03 DA5 EA5 FA5 GA5 null null null null HA5
A04 B04 C04 DA6 EA6 FA6 GA6 null null null null HA6

It is something like pivot but have some difference, I tried to use T-SQL or LINQ with C# but have no idea to do this, please someone help, many thanks.

Dan Dohotaru · Accepted Answer

sounds like a job for the ExpandoObject

relying on the input records you've provided

var input = new DataTable();
input.Columns.Add("ColA");
input.Columns.Add("ColB");
input.Columns.Add("ColC");
input.Columns.Add("ColD");
input.Rows.Add("A01", "B01", "CA1", "DA1");
input.Rows.Add("A01", "B01", "CA2", "DA2");
input.Rows.Add("A02", "B02", "CA3", "DA3");
input.Rows.Add("A02", "B02", "CA4", "DA4");
input.Rows.Add("A03", "B03", "CA5", "DA5");
input.Rows.Add("A04", "B04", "CA6", "DA6");

you could convert a record into a dynamic expandable object

public IDictionary Map(DataRow row)
{
    var columns = row.Table.Columns;
    var result = new ExpandoObject() as IDictionary;
    for (var index = 0; index < row.ItemArray.Count(); index++)
    {
        result.Add($"{columns[index]}", row[index]);
    }
    return result;
}

and then a bit of logic meant to group input by marker element and expand where needed

var seed = new List>();
var output = input
    .AsEnumerable()
    .Select(Map)
    .Aggregate(seed, (results, current)=>
    {
        // Check if the current values match any of the first element in the results
        var query = from result in results
                    let marker = result
                        .Select(p => p.Value)
                        .FirstOrDefault()
                    where current.Values.Contains(marker)
                    select result;

        var found = query.SingleOrDefault();
        if (found == null)
        {
            // None were found then simply append the current values
            results.Add(current);
        }
        else
        {
            // Some were found then isolate the new ones
            var others = from value in current.Values
                         where !found.Values.Contains(value)
                         select value;

            // Append the new ones to the found result
            foreach (var value in others)
            {
                var index = found.Values.Count;
                found.Add($"Col{index}".ToString(), value);
            }
        }

        return results;
    });

and the end result will look like this

check gist for the whole picture

How do I merge multiple rows with partial duplicated data into one row but keep non-duplicated data?

Answers (2)

Related Questions