Reputation: 43
I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.
I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:
idf = db.globalsets.Count(t => t.text.Contains("myword"));
Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.
Is 1 second per query slow for an MSSQL contains query on a remote machine?
What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)
Upvotes: 1
Views: 2942
Reputation: 332
As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.
Upvotes: 1
Reputation: 1729
You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:
Calculating a tf-idf for a given term in a given document can now be done just by:
Loading the document.
This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.
Upvotes: 1
Reputation: 13146
Applying a contains
query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search
indexes and perform query over it. You can retrieve the matched record texts with your query strings.
select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C
inner join table1 t ON C.[KEY] = t.Id
Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.
Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains
query in the loop.
Upvotes: 0