How can I speed up this regex?

Question

I am using Rails 3.2.11 on Ruby 1.9.2p320.

I have created a summarized() method to show where certain keywords appear in a series of 1,000 to 15,000 word documents that are stored as text strings. Each document can contain each keyword anywhere from 0 to 100 times.

I have:

ActiveRecord::Schema.define(:version => 20130404000559) do
  create_table "references", :force => true do |t|
    t.text     "source_text"
  end
end

When I call @reference.source_text.summarize(keywords), the following method is phenomenally slow, even when there is just one keyword in keywords:

class String
  def summarized(keywords)
    safe_text = Array.new
    result = String.new
    keywords.each do |keyword|
      safe_text << self.strip_tags.gsub(/
|
/, " ").gsub("  ", " ").scan(/\w*.{0,250}#{keyword}.{0,250}\w*/im)
    end    
    return safe_text.flatten.uniq
  end    
end

How can I speed it up?

Update: I'm now looking at the possibility that strip_tags may be at least one of the culprits.

Loamhoof · Accepted Answer

Let's begin by why it is slow. Your regex starts like that:

/\w*.{0,250}

Now let's start with the worse case (and quite a frequent one in your case). You have like 1000 characters without your keyword. Your regex will first the first word, then check 250 chracters later if your keyword is here. Nope. It will go backward to the 249th. Still nope. 250 tries later, will begin doing the same with the \w* part. You're lucky, words are pretty short. And it will do the same for the next (at least) 749 characters. So, in the worse case, you go through your string like 250 times. Laziness would fix the problem, however, I assume it's not the behavior you want (this .{0,250} is to deal with the beginning of the string, isn't it?).

Now, a cool regex tool is lookarounds. You can go ahead or back in your string to make some additional checks. While it won't be part of the captured strings, you can use capturing groups in them.
The drawback? You can't have arbitrary size loobehind (damn, just what you needed). So it wouldn't quite work cause you couldn't make sure you have a whole word at the beginning but also you wouldn't be able to match a keyword in the 250 first characters of the string.

So, here's a strange solution:
- Add 250 characters at the beginning of your string (whitespaces?..) to make sure we can catch words at the beginning.
- Use this regex (you can take into account what m_x said and use it only once for all the keywords):

/#{keyword}(?<=(.{250}))(.{0,250}\w*)/im

Then manipulate quite a bit the result (recombine the string by adding the captured groups, remove the eventual characters that were part of the 250 ones added, maybe remove the first word to make sure it's whole...) and it should be quite faster.

Edit (comment about the regex):

You could be wondering why the (?<=(.{250})) is placed AFTER the keyword. Well, it's just so the capturing is made after you matched a keyword. It also means the keyword will also be part of the capturing group, and the capturing group will only capture 250-keyword.length characters before the keyword(well it's not much of a problem, you can create your regex dynamically depending on the length of your keyword, and anyway, it's not much of a problem I guess). You could also place the lookbehind after the first character of the keyword but that'd be a pain... Well I'll stop my monologue here.

How can I speed up this regex?

Answers (2)

Related Questions