Reputation: 2828

Comparison of two large datatables is slow

I have two datatables A and B with approx. 50000 rows in each. I am comparing some of the columns (not complete raw) of table B to the datatable A, using two foreach loops. The comparision takes 2~3min to complete. Is there a way to improve performance?

    foreach (DataRow entry in B.Rows)
    {
        // Compare results
        foreach (DataRow dr in A.Rows)
        {
            if (entry[1].ToString().ToUpper()== dr[2].ToString().ToUpper() && entry[2].ToString().ToUpper()== dr[3].ToString().ToUpper())
            {
               // do stuff....
              entry[1]== dr[2];
              entry[2]== dr[3];
                break;
            }
        }
    }

Upvotes: 1

Answers (3)

Tim Schmelter

Reputation: 460158

You can use Linq-To-DataSet, Enumerable.Join is much more efficient since it's using a Dictionary/Lookup under the hood.

var join = from rowA in A.AsEnumerable()
           join rowB in B.AsEnumerable()
           on new { col1 = rowA.Field<string>(2), col2 = rowA.Field<string>(3) }
           equals new { col1 = rowB.Field<string>(1), col2 = rowB.Field<string>(2) }
           select new { rowA, rowB };
foreach (var bothRows in join)
{
    DataRow rowA = bothRows.rowA;
    DataRow rowB = bothRows.rowB;
    // do stuff....
}

Why is LINQ JOIN so much faster than linking with WHERE?

Update, acc. your comment:

I need to find and mark in the B table rows which are not in the A table. So basically on your proposed solution I would like to get join which will have like (entry1 != dr[2] || entry[2] != dr[3]))

Then you could use a combination of Enumerable.Except and Join:

var bKeyCols = B.AsEnumerable()
    .Select(r => new { col1 = r.Field<string>(1), col2 = r.Field<string>(2) });
var aKeyCols = A.AsEnumerable()
    .Select(r => new { col1 = r.Field<string>(2), col2 = r.Field<string>(3) });
var bNotInA = bKeyCols.Except(aKeyCols);
var bRowsNotInA = from row in B.AsEnumerable()
                  join keyCols in bNotInA
                  on new { col1 = row.Field<string>(1), col2 = row.Field<string>(2) } equals keyCols
                  select row;

Enumerable.Except also uses a set which makes it very efficient at the cost of some memory.

Upvotes: 3

Anders Abel

Reputation: 69260

What you are doing has O(n^2) complexity, which is bad performance wise. If it is bad now, it will be even worse when you add more lines to the tables.

The algorithm would be considerably faster if you could sort both the tables before. In that case you could loop through both the tables on the same time.

Another more simple option (but not as fast) is to convert the first of the tables to a dictionary instead, to allow random access to it when looping through the second table.

Edit

It should be possible to do this with LINQ instead, that way you won't have to do the heavy algorithm implementation yourself.

var matching = from br in B.Rows
join ar in A.Rows on ar[1] equals br[2]
where ar[2] == br[3]
select new { ar, br };

foreach (var match in matching)
{
  // Do stuff.
}

Upvotes: 3

Tigran

Reputation: 62256

It's hard to say where is a real problem, but you may try to use DataTable built-in query possibility.

Pseudocode:

DataView dv = new DataView(A);
foreach (DataRow entry in B.Rows)
{
   dv.RowFilter = "COLUMN_A1 = " +  entry[1] + " AND COLUMN_A2=" + entry[2];
   foreach (DataRowView drv in dv) {          
        //DO SOMETHING
   }
}

where COLUMN_A1 and COLUMN_A2 are respecive column names from A table

Qr simply

first get all data you need to process in one collection
process after that collection only once.

Consider that in this case you will have increased memory pressure.

Upvotes: 2

Comparison of two large datatables is slow

Answers (3)

Edit

Related Questions