Reputation: 5019
I have the following
public class SearchResult
{
public string Description { get; set; }
public int Year{ get; set; }
public int Type { get; set; }
}
I create a list of these List and cache it then I try to search through this collection (1.2M) records with following
var found = myList.Where(x => x.Description.StartsWith(query)).Take(10).ToList();
This is pretty slow what I am going for, is there a better way to store a list of objects and have the ability to search the string property of the object?
Should I be sorting the collection before caching it? I would like to have the ability to do .StartsWith and .Contains on Description property with the fastest path to get top 10 matches.
If i just go to the db its faster (i have put a index on the text field), I am hoping to improve my performance by getting the results once, sticking them in memory and then all searches are done against the cache in memory vs going to the db each time. But this is proving to be slower then the db calls using the SQL LIKE '{query}%' statement
Upvotes: 2
Views: 2943
Reputation: 51214
Fast string prefix searches are best done using a trie data structure. What's cool about the trie is that all descendants of any given node have a common prefix of the string associated with that node. It can also be compacted into a radix tree, which is perhaps slightly more complex to implement.
Right now you are using Linq-to-objects to iterate through the entire list for every search (and each StartsWith
method is O(m)
with m
being the length of the query
string). If you used Linq-to-SQL, it would get translated to an SQL query which would use indexes to perform a more efficient lookup.
This link has an example implementation of the auto-complete functionality using a trie.
(Update)
As @David mentioned in the comments, a trie might be an overkill if you already have this data loaded in a list (that is, if you need to keep it in this form anyway, which you probably do). In that case, a better alternative for StartsWith
queries would be to have the list sorted. That would allow you to use binary search to get your results in O(m log n)
.
Depending on whether the data is changed often, you might also use a balanced binary tree for storing the data, to allow you fast insertion/deletion (which is essentially what SortedDictionary
gives you).
But ultimately, if you also need Contains
queries, then you will either need to spare much more memory on indexing (like described by @Igor in his answer), or simply let your DMBS do it (suggested by @David).
On the other hand, you might want to try using SQL Server's full text search. Or, if you are willing to go a bit outside of writing SQL, Lucene.Net with in-memory caching should give you even better performance.
Upvotes: 1
Reputation: 2837
First of all - it's an information retrieval task, and it seems like there isn't efficient way to do it using only LINQ.
What you need is some kind of reverse index. Very simple implementation of reverse index in your case is a Dictionary<string, List<SearchResult>>
. For each word in a movie titles this Dictionary
contains all movies, which title contains that word. To build this simple reverse index you can use following code:
reverseIndex = new Dictionary<string, List<SearchResult>> ();
for (int i = 0; i < searchResults.Count; i++) {
var res = searchResults[i];
string[] words = res.Description.Split (new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words) {
if (!reverseIndex.ContainsKey (word))
reverseIndex [word] = new List<SearchResult> () { res };
else if (!reverseIndex[word].Contains(res))
reverseIndex [word].Add (res);
}
}
Now, instead of slow:
searchResults.Where(x => x.Description.Contains(query));
you can use simple:
reverseIndex[query];
It works very fast. Instead of
searchResults.Where(x => x.Description.StartsWith(query));
you can use:
reverseIndex[query].Where(s => s.Description.StartsWith(query));
If your query contains more then one word, you can split it to words, then extract List<SearchResult>
for each word, and then intersect lists.
With this simple implementation of reverse index your query can contain only whole words. If you want to search by a part of word, you need to use permuterm index. One possible implementation on C# you can find here. Note, that permuterm index need a lot of additional memory.
Upvotes: 1
Reputation: 24525
String comparisons are inherently slow, additionally you have to iterate the entire list a full time to see if there is a match. This is never going to perform well, and in fact will more than likely get worse as time goes on as new records are added to the source.
Here is a nice article on string searching for those concerned with speed.
I would suggest doing just as you mentioned, move the searching to the database and limit the number of returned rows. While this is still I/O, databases are optimized for handling this sort of thing. Some other advantages are that you end up not having the pitfall of having your app crash and losing that cached searches, likewise you can leverage async/await
which will make you app more responsive.
If you decide that you still want to go the route of pulling everything into memory and then querying the objects, good luck. The only other suggestion I have would be to consider search caching, such that if someone searches for the same thing relatively recently - you could cache those results and immediately return them.
From the same author, here is another source for reading - here he compares collection string lookup speed.
http://cc.davelozinski.com/c-sharp/fastest-collection-for-string-lookups
Upvotes: 1