Adam Thompson
Adam Thompson

Reputation: 43

How can I deal with slow performance on Contains query in Entity Framework / MS-SQL?

I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.

I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:

idf = db.globalsets.Count(t => t.text.Contains("myword"));

Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.

Is 1 second per query slow for an MSSQL contains query on a remote machine?

What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)

Upvotes: 1

Views: 2942

Answers (3)

Arnel
Arnel

Reputation: 332

As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.

Upvotes: 1

Adam Brown
Adam Brown

Reputation: 1729

You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:

  1. Create a dictionary Dictionary<string,int> documentFrequency
  2. Load each document in the database in turn, and split into words, then apply stemming. Then, for each distinct stem in the document, add 1 to the value in the documentFrequency dictionary.
  3. Once all documents are processed this way, write the document frequencies back to the database.
  4. Calculating a tf-idf for a given term in a given document can now be done just by:

  5. Loading the document.

  6. Counting the number of instances of the term.
  7. Loading the correct idf score from the idf table in the database.
  8. Doing the tf-idf calculation.

This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.

Upvotes: 1

Emre Kabaoglu
Emre Kabaoglu

Reputation: 13146

Applying a contains query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search indexes and perform query over it. You can retrieve the matched record texts with your query strings.

select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C 
inner join table1 t ON C.[KEY] = t.Id

Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.

Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains query in the loop.

Upvotes: 0

Related Questions