Dot NET
Dot NET

Reputation: 4897

Querying database for every record faster than using LINQ

I have around 700K rows which are being iterated. For each row, a SELECT sql statement is run on the database to check if the 'name' field from this current record exists in the respective table.

A database read for 700K times strikes me as being very inefficient, so I opted to read all the data before the loop, store it in a DataTable, and check if the respective record is contained in the DataTable through LINQ, on each iteration.

On doing this, performance deteriorated by quite a bit. The process is taking around double the time to complete now (proven multiple times through benchmarking).

This is the original (faster) code:

for (int index = 0; index < dtSightings.Rows.Count; index++)
{
   DataTable dtResults = Utilities.ExecuteQueryMysqlString(connectionString, "SELECT name FROM my_table WHERE name = @name AND month_year = @monthYear", dictionary);

   if (dtResults == null || dtResults.Rows.Count == 0)
   {
   //Continue
   }
}

public static DataTable ExecuteQueryMysqlString(string connectionString, string sql, Dictionary<string, object> listParameters)
        {
            DataTable dtResults = new DataTable();

            if (string.IsNullOrWhiteSpace(connectionString) == false)
            {
                connectionString += ";Allow User Variables=True;";

                try
                {
                    using (MySqlConnection connection = new MySqlConnection(connectionString))
                    {
                        connection.Open();

                        using (MySqlCommand cmd = connection.CreateCommand())
                        {
                            cmd.CommandTimeout = 0;
                            cmd.CommandText = sql;

                            if (listParameters != null && listParameters.Count > 0)
                            {
                                foreach (string currentKey in listParameters.Keys)
                                {
                                    cmd.Parameters.Add(new MySqlParameter(currentKey, GetDictionaryValue(listParameters, currentKey)));
                                }
                            }

                            using (MySqlDataAdapter da = new MySqlDataAdapter(cmd))
                            {
                                da.Fill(dtResults);
                            }
                        }
                    }

                    return dtResults;
                }
                catch (Exception ex)
                {
                    MessageBox.Show("ERROR: " + ex.Message, "ERROR", MessageBoxButtons.OK, MessageBoxIcon.Error);
                    return dtResults;
                }
            }
            else
            {
                return dtResults;
            }
        }

This is the "optimised" (however slower) code:

DataTable dt= Utilities.ExecuteQueryMysqlString(connectionString, "SELECT name, month_year FROM my_table", null);

for (int index = 0; index < dtSightings.Rows.Count; index++)
{
  DataRow row = dt.AsEnumerable().Where(r => r.Field<string>("name").Equals(name, StringComparison.InvariantCultureIgnoreCase) && r.Field<DateTime>("month_year") == new DateTime(billYear, billMonth, 1)).FirstOrDefault();

  if (hasResidentBeenDiscoveredPreviously == null)
  {
    //Continue
  }
}

I cannot see why the first approach is so much faster. Is there perhaps a more optimised approach inplace of the second approach?

Upvotes: 0

Views: 1247

Answers (2)

Ivan Stoev
Ivan Stoev

Reputation: 205589

The LINQ approach is slow because Where is basically a linear search, and when performed inside a loop, can really slowdown the process.

What you really need is a fast hash based lookup data structure. I would suggest you using a HashSet with a custom data like this (mainly to support case insensitive name lookup):

public struct NameDatePair : IEquatable<NameDatePair>
{
    public readonly string Name;
    public readonly DateTime Date;
    public NameDatePair(string name, DateTime date) { Name = name; Date = date; }
    static IEqualityComparer<string> NameComparer {  get { return StringComparer.InvariantCultureIgnoreCase; } }
    public override int GetHashCode() { return NameComparer.GetHashCode(Name) ^ Date.GetHashCode(); }
    public override bool Equals(object obj) { return obj is NameDatePair && Equals((NameDatePair)obj); }
    public bool Equals(NameDatePair other) { return NameComparer.Equals(Name, other.Name) && Date == other.Date; }
}

And here is how you can use it in your case (it should be much faster than both your approaches):

var dt = Utilities.ExecuteQueryMysqlString(connectionString, "SELECT name, month_year FROM my_table", null);
var nameDatePairSet = new HashSet<NameDatePair>(dt.AsEnumerable().Select(
    r => new NameDatePair(r.Field<string>("name"), r.Field<DateTime>("month_year"))));

for (int index = 0; index < dtSightings.Rows.Count; index++)
{
    var dr = dtSightings.Rows[index];
    var name = dr.Field<string>("name");
    var billYear = dr.Field<int>("billYear");
    var billMonth = dr.Field<int>("billMonth");
    bool exists = nameDatePairSet.Contains(new NameDatePair(name, new DateTime(billYear, billMonth, 1)));
}

(Since you didn't show where the variables name, billYear and billMonth come from, the above code has some guesses, you can adjust it for your needs)

Upvotes: 1

O. Jones
O. Jones

Reputation: 108651

Both your code samples have this basic structure.

1. For each row in one bunch of rows ...
  2. read the rows in another bunch of rows ...
    3. to identify a particular situation.

That's the very definition of an O(n-squared) algorithm. Its performance is bad and gets dramatically worse as you get more data.

Your first example is faster because you use a SELECT to read the rows in step 2 and your DMBS probably optimizes it. In your second example, in step 2 you iterate over all the rows, for every row in step 1.

The trick is to rig things up so you only have to pass through the table of step 2 just once.

It's hard to tell exactly what you're doing with dtSightings: you run the index variable but you don't seem to use it anywhere. At any rate, this algorithm outline should get you from O(n-squared) to O(n log n).

1. make a HashSet ready to hold results for matching in dtSightings.
2. for each row in dtSightings ...
   a. populate that row into the HashSet
3. for each row in your query..
   b. Look it up in your HashSet
   c. If you get a hit (a non-match, I believe) report it.

Steps 2 and 3 each take O(n) time: They're proportional to the number of rows you're processing. Substep b takes O(log n) each time you run it. That's where O(n log n) comes from.

Any programmer who deals with megarows of data needs to wrap her head around computational complexity -- the O(n) stuff -- to succeed.

Upvotes: 1

Related Questions