Martin Ch
Martin Ch

Reputation: 1367

Speed up string search algorithm

I am using this simple algorithm for searching some text in document and taging on which page I found it

for (int i = 1; i <= a.PageCount; i++)
{
    Buf.Append(a.Pages[i].Text);
    String contain = Buf.ToString();
    if (contain != "")
    {
        // Inside is dictionary of keys and value contain page where I found it
        foreach (KeyValuePair<string, List<string>> pair in inside)
        {
              if (contain.Contains(pair.Key))
                  inside[pair.Key].Add((i).ToString());
        }
    }

    Buf.Clear();
 }

I have no problem with it, but when I search in 700 pages document and I am looking for over 500 keys, its very slow, took about 1-2 minutes to pass, is there any way how to speed it up? I am using c#

Thanks!

Upvotes: 0

Views: 505

Answers (4)

phoog
phoog

Reputation: 43036

A few points:

  • Get rid of Buf; just assign a.Pages[i].Text directly to contain:
  • inside[pair.Key] wastes time looking up the value associated with that key; the time is wasted because you have a much cheaper reference to that object in pair.Value.
  • if you have a list of integer values, why are you storing them as strings?

Sample code:

for (int i = 1; i <= a.PageCount; i++)
{
    String contain = a.Pages[i].Text
    if (contain != "")
    {
        // Inside is dictionary of keys and value contain page where I found it
        foreach (KeyValuePair<string, List<int>> pair in inside)
        {
            if (contain.Contains(pair.Key))
                pair.Value.Add(i);
        }
    }
}

Finally, make sure Pages does in fact use a one-based index. Collections are more commonly zero-indexed.

EDIT since Pages is a dictionary:

foreach (KeyValuePair<int, Page> kvp in a.Pages)
{
    string contain = kvp.Value.Text;
    if (contain == "")
        continue;
    foreach (KeyValuePair<string, List<int>> pair in inside)
        if (contain.Contains(pair.Key))
            pair.Value.Add(kvp.Key);
}

How many times did you time the first code sample? The time could vary depending on many external factors; the fact that a single run of one approach is faster or slower than a single run of another doesn't really tell you much, especially since the suggestions I made probably don't address the bulk of the problem.

As someone else pointed out, the main problem is that you're calling contain.Contains(pair.Key) 350,000 times; that's probably your bottleneck. You can profile the method to find out if that is true. If it is true, then something like the Rabin Karp algorithm as suggested by Miserable Variable is probably your best bet.

Upvotes: 4

Miserable Variable
Miserable Variable

Reputation: 28752

[[

EDIT: The following is irrelevant, since you are clearing Buf at the end of loop (though note that you don't really need buf, string pageText = a.Pages[i].Text is all you need)

What is Buf? You have

Buf.Append(a.Pages[i].Text);

Does that not force the Contains to look through increasingly large size strings? I am surprised you are not running out of memory with 700 pages.

]]

There are more efficient ways to see if any of a set of strings appears in another string. For example, you could prepare a tree structure of the keys so you don't have to compare multiple times.

See Rabin-Karp Algorithm

Do consider existing third party libraries, there must be some.

Upvotes: 3

n8wrl
n8wrl

Reputation: 19765

Standard performance/debugging procedures - comment out pieces of your code and measure. Add them back one at a time until it 'gets bad.' That's likely your problem area.

for example, you might start by commenting out the entire foreach.

It looks like there's some possibly-complex/expensive objects in use - inside, Buf, etc. Comment out the usage of those and put them back one at a time.

Upvotes: 0

dtb
dtb

Reputation: 217263

I don't have 700 pages to test with, but you could try using a regex:

var s = Stopwatch.StartNew();
var r = new Regex(string.Join("|", from x in inside select Regex.Escape(x.Key)));

for (int i = 1; i <= a.PageCount; i++)
{
    foreach (Match match in r.Matches(a.Pages[i].Text))
    {
        inside[match.Value].Add(i.ToString());
    }
}

Console.WriteLine(s.Elapsed);

Upvotes: 0

Related Questions