Reputation: 1367
I am using this simple algorithm for searching some text in document and taging on which page I found it
for (int i = 1; i <= a.PageCount; i++)
{
Buf.Append(a.Pages[i].Text);
String contain = Buf.ToString();
if (contain != "")
{
// Inside is dictionary of keys and value contain page where I found it
foreach (KeyValuePair<string, List<string>> pair in inside)
{
if (contain.Contains(pair.Key))
inside[pair.Key].Add((i).ToString());
}
}
Buf.Clear();
}
I have no problem with it, but when I search in 700 pages document and I am looking for over 500 keys, its very slow, took about 1-2 minutes to pass, is there any way how to speed it up? I am using c#
Thanks!
Upvotes: 0
Views: 505
Reputation: 43036
A few points:
Buf
; just assign a.Pages[i].Text
directly to contain
:inside[pair.Key]
wastes time looking up the value associated with that key; the time is wasted because you have a much cheaper reference to that object in pair.Value
.Sample code:
for (int i = 1; i <= a.PageCount; i++)
{
String contain = a.Pages[i].Text
if (contain != "")
{
// Inside is dictionary of keys and value contain page where I found it
foreach (KeyValuePair<string, List<int>> pair in inside)
{
if (contain.Contains(pair.Key))
pair.Value.Add(i);
}
}
}
Finally, make sure Pages
does in fact use a one-based index. Collections are more commonly zero-indexed.
EDIT since Pages
is a dictionary:
foreach (KeyValuePair<int, Page> kvp in a.Pages)
{
string contain = kvp.Value.Text;
if (contain == "")
continue;
foreach (KeyValuePair<string, List<int>> pair in inside)
if (contain.Contains(pair.Key))
pair.Value.Add(kvp.Key);
}
How many times did you time the first code sample? The time could vary depending on many external factors; the fact that a single run of one approach is faster or slower than a single run of another doesn't really tell you much, especially since the suggestions I made probably don't address the bulk of the problem.
As someone else pointed out, the main problem is that you're calling contain.Contains(pair.Key)
350,000 times; that's probably your bottleneck. You can profile the method to find out if that is true. If it is true, then something like the Rabin Karp algorithm as suggested by Miserable Variable is probably your best bet.
Upvotes: 4
Reputation: 28752
[[
EDIT: The following is irrelevant, since you are clearing Buf at the end of loop (though note that you don't really need buf, string pageText = a.Pages[i].Text
is all you need)
What is Buf
? You have
Buf.Append(a.Pages[i].Text);
Does that not force the Contains
to look through increasingly large size strings? I am surprised you are not running out of memory with 700 pages.
]]
There are more efficient ways to see if any of a set of strings
appears in another string
. For example, you could prepare a tree structure of the keys so you don't have to compare multiple times.
Do consider existing third party libraries, there must be some.
Upvotes: 3
Reputation: 19765
Standard performance/debugging procedures - comment out pieces of your code and measure. Add them back one at a time until it 'gets bad.' That's likely your problem area.
for example, you might start by commenting out the entire foreach.
It looks like there's some possibly-complex/expensive objects in use - inside, Buf, etc. Comment out the usage of those and put them back one at a time.
Upvotes: 0
Reputation: 217263
I don't have 700 pages to test with, but you could try using a regex:
var s = Stopwatch.StartNew();
var r = new Regex(string.Join("|", from x in inside select Regex.Escape(x.Key)));
for (int i = 1; i <= a.PageCount; i++)
{
foreach (Match match in r.Matches(a.Pages[i].Text))
{
inside[match.Value].Add(i.ToString());
}
}
Console.WriteLine(s.Elapsed);
Upvotes: 0