Reputation: 7024

C#: Efficiently search a large string for occurences of other strings

I'm using C# to continuously search for multiple string "keywords" within large strings, which are >= 4kb. This code is constantly looping, and sleeps aren't cutting down CPU usage enough while maintaining a reasonable speed. The bog-down is the keyword matching method.

I've found a few possibilities, and all of them give similar efficiency.

1) http://tomasp.net/articles/ahocorasick.aspx -I do not have enough keywords for this to be the most efficient algorithm.

2) Regex. Using an instance level, compiled regex. -Provides more functionality than I require, and not quite enough efficiency.

3) String.IndexOf. -I would need to do a "smart" version of this for it provide enough efficiency. Looping through each keyword and calling IndexOf doesn't cut it.

Does anyone know of any algorithms or methods that I can use to attain my goal?

Upvotes: 5

Answers (5)

gary

Reputation: 624

I just posted this on a similar thread but it is probably more relevant here.

I'm doing a similar search, basically looking for keywords of around 10-50 bytes in length within text of approximately 45k bytes. I search about 1900 keywords over nine million texts so getting this as fast as possible is also a similar priority.

So, the fastest method I've found using .NET 4 is parallel Regex IsMatch.

Here's an example of getting total matches -

needles.AsParallel ( ).Sum ( l => Regex.IsMatch ( haystack , Regex.Escape ( l ) ) ? 1 : 0 );

This works for my scenario (above), it is 55% faster than ordinal indexOf parallel comparisons in my tests at least for the sort of data size I am using. I also imagine the speed improvement only occurs if you are using multi-core machines.

Would be interested if anyone can find a faster method?

Upvotes: 0

csharptest.net

Reputation: 64248

Actually I had to solve this before, it was kinda fun. I had 20k html pages, each with a title, and wanted all other occurrence of the title on other pages to link to the page with that title. Sounds very similar to what your trying to accomplish.

The approach:

Process the text of a file by turning it into a linked list of { Word, Whitespace } where the Word was identified as a contiguous alpha-numeric sequence with a few special characters, and the whitespace was everything that led up to the next word.
Repeated the process in step 1 for each 'title' of the pages I wanted to link to.
Each word from the node in the linked list in step 1 was then added to a binary-sorted list.
Now you just need to walk the first word from each title linked-list from step 2 and seek into the binary sorted list from step 3. You may find several hits or even soft-hits when a word is plural so you might have several starting nodes from the binary list you need to test.
Once you process the document into the form described in step 1 it's actually very easy to modify by inserting new nodes and/or modifying the Whitespace value. Once complete you simply walk the entire list and dump it all to a stream.

It sounds more complicated than it is, it took about two days to get it working well.

However you solve it, have fun with it :)

Upvotes: 2

Guffa

Reputation: 700910

I developed an efficient use of IndexOf for this question:

A better way to replace many strings - obfuscation in C#

It uses a list of keywords and their next position in the string. That way you only need to call IndexOf once for each keyword and then once for each match you find. It's especially efficient when replacing keywords in a large string, as you can process the string from start to end instead of processing the entire string once for each keyword. I don't know why you are looking for the keywords in the strings and what you do with the strings, but perhaps it may be useful in your situation.

Upvotes: 2

Erich Kitzmueller

Reputation: 36987

Are you always looking for the same keywords? Try Boyer-Moore. It requires some pre-processing for the keywords, but gains speed afterwards.

Upvotes: 3

Jon Skeet

Reputation: 1504092

I haven't tried it, but have you looked at Rabin-Karp? Apparently it has a bad worst-case complexity, but is usually quite good.

What do your keywords look like? In particular, are they always delimited by spaces (or something similar)? If so, you could basically look through the string once looking for "words" and then either create a map from a word to the list of indexes of that word, or perhaps only do so for keywords you're interested in.

If you could give more details of the exact situation (such as the keywords, delimiters and what you need the result of your search to be) that would help.

Upvotes: 3

C#: Efficiently search a large string for occurences of other strings

Answers (5)

Related Questions