HerrimanCoder
HerrimanCoder

Reputation: 7218

Find matching records in DataTable as fast as possible

I have C# DataTables with very large numbers of rows, and in my importer app I must query these hundreds of thousands of times in a given import. So I'm trying to find the fastest possible way to search. Thus far I am puzzling over very strange results. First, here are 2 different approaches I have been experimenting with:

APPROACH #1

public static bool DoesRecordExist(string keyColumn, string keyValue, DataTable dt)
{
    if (dt != null && dt.Rows.Count > 0)
       return dt.Select($"{keyColumn} = '{SafeTrim(keyValue)}'").Count() > 0;
    else
       return false;
}

APPROACH #2

public static bool DoesRecordExist(string keyColumn, string keyValue, DataTable dt)
{
    if (dt != null && dt.Rows.Count > 0)
    {
        int counter = dt.AsEnumerable().Where(r => string.Equals(SafeTrim(r[keyColumn]), keyValue, StringComparison.CurrentCultureIgnoreCase)).Count();
        return counter > 0;
    } 
    else
       return false;
}

In a mock test I run each method 15,000 times, handing in hardcoded data. This is apples-to-apples, a fair test. Approach #1 is dramatically faster. But in actual app execution, Approach #1 is dramatically slower.

Why the counterintuitive results? Is there some other faster way to query datatables that I haven't tried?

EDIT: The reason I use datatables as opposed to other types of collections is because all my datasources are either MySQL tables or CSV files. So datatables seemed like a logical choice. Some of these tables contain 10+ columns, so different types of collections seemed an awkward match.

Upvotes: 0

Views: 1549

Answers (3)

da_jokker
da_jokker

Reputation: 1033

All. BINGO! Wanted to share as a different answer just because my previous might be suited for a bit of a different approach. In this scenario, I was able to go from 8 MINUTES, down to 6 SECONDS, not using either approaches...

Again, the key is a HashTable, or in my case a dictionary because I had multiple records. To recap, for me, I needed to delete 1 row from my DataTable for every matching record I found in another DataTable. With the goal that in the end, my First Datatable only contained the "Missing" records.

This uses a different function...

    // -----------------------------------------------------------
    // Creates a Dictionary with Grouping Counts from a DataTable
    public Dictionary<string, Int32> GroupBy(DataTable myDataTable, params string[] pGroupByFieldNames)
    {
        Dictionary<string, Int32> myGroupBy = new Dictionary<string, Int32>(StringComparer.InvariantCultureIgnoreCase);  //Makes the Key Case Insensitive
        foreach (DataRow myRecord in myDataTable.Rows)
        {
            string myKey = "";
            foreach (string strGroupFieldName in pGroupByFieldNames)
            {
                myKey += Convert.ToString(myRecord[strGroupFieldName]).Trim();
            }

            if (myGroupBy.ContainsKey(myKey) == false)
            {
                myGroupBy.Add(myKey, 1);
            }
            else
            {
                myGroupBy[myKey] += 1;
            }
        }

        return myGroupBy;
    }

Now.. say you have a Table of Records that you want to use as the "Match Values" based on Col1 and Col2

Dictionary<string, Int32> myQuickLookUpCount = GroupBy(myMatchTable, "Col1", "Col2");

And now the magic. We are looping through your Primary Table, and removing 1 instance of a record for each instance in the Matching Table. This is the part that took 8 minutes with Approach #2, or 38 minutes with Approach #1.. but now only takes seconds.

myDataTable.AcceptChanges(); //Trick that allows us to delete during a ForEach!
foreach (DataRow myDataRow in myDataTable.Rows)
{
   //Grab the Key Values
   string strKey1Value = Convert.ToString(myDataRow ["Col1"]);
   string strKey2Value = Convert.ToString(myDataRow ["Col2"]);

   if (myQuickLookUpCount.TryGetValue(strKey1Value + strKey2Value, out Int32 intTotalCount) == true && intTotalCount > 0)
   {
     myDataTable.Delete();  //Mark our Row to Delete
     myQuickLookUpCount [strKey1Value + strKey2Value ] -= 1;  //Decrement our Counter
    }
 }

 myDataTable.AcceptChanges(); //Commits our changes and actually deletes the rows.

Upvotes: 0

da_jokker
da_jokker

Reputation: 1033

I have a very similar issue except that I need the actual First Occurrence of a matching row.

Using the .Select.FirstOrDefault (Approach 1) takes 38 minutes to run. Using the .Where.FirstOrDefault (Approach 2) takes 6 minutes to run.

In a similar situation where I didn't need the FirstOrDefault, but just needed to find and work with the uniquely matching record, what I found to be the fastest by far is to use a HashTable where the Key is the Combined Values of any Columns you are trying to match, and the Value is the Data Row itself. Finding a Match is near instant.

The Function is

 public Hashtable ConvertToLookup(DataTable myDataTable, params string[] pKeyFieldNames)
    {
        Hashtable myLookup = new Hashtable(StringComparer.InvariantCultureIgnoreCase);  //Makes the Key Case Insensitive
        foreach (DataRow myRecord in myDataTable.Rows)
        {
            string myHashKey = "";
            foreach (string strKeyFieldName in pKeyFieldNames)
            {
                myHashKey += Convert.ToString(myRecord[strKeyFieldName]).Trim();
            }

            if (myLookup.ContainsKey(myHashKey) == false)
            {
                myLookup.Add(myHashKey, myRecord);
            }
        }

        return myLookup;
    }

The usage is...

//Build the Lookup Table
Hashtable myLookUp = ConvertToLookup(myDataTable, "Col1Name", "Col2Name");

//Use it
if (myLookUp.ContainsKey(mySearchForValue) == true)
{
  DataRow myRecord = (DataRow)myLookUp[mySearchForValue]);
}

Upvotes: 0

Olivier Jacot-Descombes
Olivier Jacot-Descombes

Reputation: 112602

If you want a faster access and still want to stick to the DataTables, use a dictionary to store the row numbers for given keys. Here I assume that each key is unique in the DataTable. If not, you would have to use a Dictionary<string, List<int>> or Dictionary<string, HashSet<int>> to store the indexes.

var indexes = new Dictionary<string, int>();
for (int i = 0; i < dt.Rows.Count; i++) {
    indexes.Add((string)dt.Rows[i].Column(keyColumn), i);
}

Now you can access a row in a super fast way with

var row = dt.Rows[indexes[theKey]];

Upvotes: 2

Related Questions