Reputation: 7024
I'm using C# to continuously search for multiple string "keywords" within large strings, which are >= 4kb. This code is constantly looping, and sleeps aren't cutting down CPU usage enough while maintaining a reasonable speed. The bog-down is the keyword matching method.
I've found a few possibilities, and all of them give similar efficiency.
1) http://tomasp.net/articles/ahocorasick.aspx -I do not have enough keywords for this to be the most efficient algorithm.
2) Regex. Using an instance level, compiled regex. -Provides more functionality than I require, and not quite enough efficiency.
3) String.IndexOf. -I would need to do a "smart" version of this for it provide enough efficiency. Looping through each keyword and calling IndexOf doesn't cut it.
Does anyone know of any algorithms or methods that I can use to attain my goal?
Upvotes: 5
Views: 7593
Reputation: 624
I just posted this on a similar thread but it is probably more relevant here.
I'm doing a similar search, basically looking for keywords of around 10-50 bytes in length within text of approximately 45k bytes. I search about 1900 keywords over nine million texts so getting this as fast as possible is also a similar priority.
So, the fastest method I've found using .NET 4 is parallel Regex IsMatch.
Here's an example of getting total matches -
needles.AsParallel ( ).Sum ( l => Regex.IsMatch ( haystack , Regex.Escape ( l ) ) ? 1 : 0 );
This works for my scenario (above), it is 55% faster than ordinal indexOf parallel comparisons in my tests at least for the sort of data size I am using. I also imagine the speed improvement only occurs if you are using multi-core machines.
Would be interested if anyone can find a faster method?
Upvotes: 0
Reputation: 64248
Actually I had to solve this before, it was kinda fun. I had 20k html pages, each with a title, and wanted all other occurrence of the title on other pages to link to the page with that title. Sounds very similar to what your trying to accomplish.
The approach:
It sounds more complicated than it is, it took about two days to get it working well.
However you solve it, have fun with it :)
Upvotes: 2
Reputation: 700910
I developed an efficient use of IndexOf for this question:
A better way to replace many strings - obfuscation in C#
It uses a list of keywords and their next position in the string. That way you only need to call IndexOf once for each keyword and then once for each match you find. It's especially efficient when replacing keywords in a large string, as you can process the string from start to end instead of processing the entire string once for each keyword. I don't know why you are looking for the keywords in the strings and what you do with the strings, but perhaps it may be useful in your situation.
Upvotes: 2
Reputation: 36987
Are you always looking for the same keywords? Try Boyer-Moore. It requires some pre-processing for the keywords, but gains speed afterwards.
Upvotes: 3
Reputation: 1504092
I haven't tried it, but have you looked at Rabin-Karp? Apparently it has a bad worst-case complexity, but is usually quite good.
What do your keywords look like? In particular, are they always delimited by spaces (or something similar)? If so, you could basically look through the string once looking for "words" and then either create a map from a word to the list of indexes of that word, or perhaps only do so for keywords you're interested in.
If you could give more details of the exact situation (such as the keywords, delimiters and what you need the result of your search to be) that would help.
Upvotes: 3