Vas
Vas

Reputation: 343

Regex to capture words before and after a target in ruby

Assuming we have a text:

In software, a stack overflow occurs if the call stack pointer exceeds the stack bound. The call stack may consist of a limited amount of address space, often determined at the start of the program. The size of the call stack depends on many factors, including the programming language, machine architecture, multi-threading, and amount of available memory.

What I am trying to do is find 2 words before and after a specific word (target). So for example if target is word start it should match 'at' 'the' (left) and 'of' 'the' (right). I am using the following method in ruby but it returns no matches. Any tips about what to fix in my regex? I have also tried "#{target}" instead of Regex.escape.

    def checkWords(target, text, numLeft = 2, numRight = 2)

        regex = ""
        regex += " (\\S+) " * numLeft
        regex += Regexp.escape(target)
        regex += " (\\S+)" * numRight

        pattern = Regexp.new(regex, Regexp::IGNORECASE)
        matches = pattern.match(text)

        return true if matches
    end

Edit:

Regex printed:

(\S+)  (\S+) "£52" (\S+) (\S+)

Edit based on Wiktor Stribiżew:

def checkWords(target, text, numLeft = 2, numRight = 2)

pattern = Regexp.new(/#{"(\\S+) "*numLeft}#{Regexp.escape(target)}#{" (\\S+)"*numRight}/i)
matches = pattern.match(text)

end

Upvotes: 3

Views: 1455

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627537

You have spaces doubled around the first (\\S+):

regex += " (\\S+) " * numLeft
          ^

When you double it, this part looks like " (\\S+) (\\S+) " - there are 2 spaces between (\\S+)s.

So, in your case, just use

def checkWords(target, text, numLeft = 2, numRight = 2)
    text[/#{"(\\S+) "*numLeft}#{Regexp.escape(target)}#{" (\\S+)"*numRight}/i]
end
puts checkWords('start', 'In software, a stack overflow occurs if the call stack pointer exceeds the stack bound. The call stack may consist of a limited amount of address space, often determined at the start of the program. The size of the call stack depends on many factors, including the programming language, machine architecture, multi-threading, and amount of available memory.')

See Ruby demo

It might be a good idea to add + after the spaces next to (\S+). And if you do not need the captures, remove the parentheses from around \S+.

Upvotes: 1

andyg0808
andyg0808

Reputation: 1403

In the case you're looking at, I think you might be better served by splitting the text on non-word characters and then searching through the splits for your target word. Once you've found it, it's very easy to take the appropriate slices of the array of words in order to get the results you want.

For example:

def check_words(target, text, num_left = 2, num_right = 2)
  # Split the text using the regex /\W+/ (matches non-word characters)
  words = text.split /\W+/
  # Iterate over the words in the array
  # Enumerable#each_with_index includes the index, so retrieving the surrounding
  # words is a snap
  words.each_with_index do |word, index|
    if word == target
      # Make a hash with two Symbol keys and small
      # arrays containing the desired words
      return {
        before: words.slice(index - num_left, num_left),
        after: words.slice(index, num_right)
      }
    end
  end
end

This can then be called like so:

check_words('start', text)

And it returns a Hash containing the num_left words before and the num_right words after the keyword:

{:before=>["at", "the"], :after=>["start", "of"]}

The {before: ...} syntax is Ruby 2 for {:before => ...}; either syntax will work fine.

Also, you may be interested in the Ruby documentation for Regexp, if you've not seen it already.

Upvotes: 2

Aleksei Matiushkin
Aleksei Matiushkin

Reputation: 121010

▶ input[/(\S+\s+){,2}start(\s+\S+){,2}/i]
#⇒ "at the start of the"

more generic:

▶ target = 'start'
▶ input[/(\S+\s+){,2}#{Regexp.escape target}(\s+\S+){,2}/i]
#⇒ "at the start of the"

To handle a punctuation after the target:

▶ target = 'start'
▶ input[/(\S+\s+){,2}#{Regexp.escape target}\p{P}?(\s+\S+){,2}/i]
#⇒ "at the start of the"

Your function might look like:

def checkWords(target, text, numLeft = 2, numRight = 2)
  text =~ /(\S+\s+){,#{numLeft}}#{Regexp.escape target}\p{P}?(\s+\S+){,#{numRight}}/i
end

Upvotes: 4

Related Questions