Reputation: 21

Optimize looping through large datatable

I'm loping datatable with 100 to 10000 rows, comparing each row to each other through doyble loop.

for (int i = 0; i < DT1.Rows.Count; i++)
{
    for (int j = 0; j < DT1.Rows.Count; j++)
    {
        //some code to compare data
    }
}

For 100-200 rows it's done in few minutes, which is OK, but comparing few thousands rows to few thousands, takes hours and isn't finished.

What can I do to speed it up? Best I thought up is to use lists of objects, instead of datatables.

Any other sugestions?

Can thread be used to do this?

Thanks.

Upvotes: 2

Answers (6)

Terkanil

Reputation: 51

I recently came across a similar scenario that I had to work through. Though in my case, I was comparing a pair of excel files. For my trial run, after getting it working, I had 530 rows on one side and 459000 on the other inside nested loop. This is roughly 234 million iterations. My program was able to work through it in roughly 30 seconds. I used a foreach in this scenario:

foreach (DataRow r1 in DT1.Rows) //Loop the First Source data
{
    foreach (DataRow r2 in DT2.Rows) //Loop the Second Source data
    {
         //Comparison code here...
    }
}

Edit: In your loop, as a point of reference, you are causing 3 variables to be tracked at each iteration of the loops, first and second are your counters. The third is the major performance hit, DT1.Rows.Count. By using the Direct row count as a part of the loops, it must be re-evaluated at each iteration. This adds unneeded time to the program. If you absolutely require that there be the counters, then assign the Row count out first:

int DT1Count = DT1.Rows.Count;
for (int i = 0; i < DT1Count; i++)
{
    for (int j = 0; j < DT1Count; j++)
    {
        //some code to compare data
    }
}

This way, the row count is static and shall remove the extra processing needed to evaluate the row count at each iteration.

Upvotes: 5

Daniel Hilgarth

Reputation: 174457

The biggest optimization to be made here is the following:
Currently, you are comparing each value twice. For example, in the first iteration of the loop, you are comparing the first row with itself, because both loops start at index 0.

The simplest fix to this would be to change the inner loop to this:

for (int j = i + 1; j < DT1.Rows.Count; j++)

This will dramatically reduce the number of comparisons. Your algorithm currently needs n^2 comparisons. The proposed fix reduces this number to less than the half. With the fix you only need (n^2 - n) / 2 comparisons.

Upvotes: 0

Kamil Krasinski

Reputation: 529

you could also count on .NET internals to do better job than manual looping using:

DataTable.Select(filterExpression, sortExpression)

Upvotes: 0

Sergey Kalinichenko

Reputation: 727047

Although you can certainly optimize your search by using hash tables, the best optimization is to let the database engine to the search for you. RDBMS engines are optimized for this kind of task - no client-side optimization should be able to beat it. Your biggest disadvantage is having to pull the data from the database into your program. This is very slow. The database engine has all the data right there - this is a huge advantage.

For example, if you are looking for rows representing users with identical first and last name, a simple query with a self-join will get you results in seconds, not minutes, because the data never leaves the engine.

select u1.userId, u2.userId
from User u1
join User u2 on u1.FirstName=u2.FirstName and u1.LastName=u2.LastName

Assuming that FirstName and LastName columns are indexed, this query will find you duplicates very quickly.

Upvotes: 1

Denis Ermolin

Reputation: 5556

    for (int i = 0; i < DT1.Rows.Count; i++)
    {
      for (int j = i+1; j < DT1.Rows.Count; j++) //<-- starts from next row
      {
        //some code to compare data
      }
    }

Upvotes: 0

maloney

Reputation: 1663

If the results are sorted in some sort of order you can put the results into an array and loop through using a Binary Search

Upvotes: 0

Optimize looping through large datatable

Answers (6)

Related Questions