Konstantin
Konstantin

Reputation: 3123

How to backreference in Ruby regular expression (regex) with gsub when I use grouping?

I would like to patch some text data extracted from web pages. sample:

t="First sentence. Second sentence.Third sentence."

There is no space after the point at the end of the second sentence. This sign me that the 3rd sentence was in a separate line (after a br tag) in the original document.

I want to use this regexp to insert "\n" character into the proper places and patch my text. My regex:

t2=t.gsub(/([.\!?])([A-Z1-9])/,$1+"\n"+$2)

But unfortunately it doesn't work: "NoMethodError: undefined method `+' for nil:NilClass" How can I properly backreference to the matched groups? It was so easy in Microsoft Word, I just had to use \1 and \2 symbols.

Upvotes: 31

Views: 20116

Answers (3)

Joshua Cheek
Joshua Cheek

Reputation: 31726

You can backreference in the substitution string with \1 (to match capture group 1). However, the literal backslash has to escaped as \\ when using a double-quote string literal:

t = "First sentence. Second sentence.Third sentence!Fourth sentence?Fifth sentence."
t.gsub(/([.!?])([A-Z1-9])/, "\\1\n\\2")
#=> "First sentence. Second sentence.\nThird sentence!\nFourth sentence?\nFifth sentence."

Upvotes: 36

Ben Wiseley
Ben Wiseley

Reputation: 547

If you got here because of Rubocop complaining "Avoid the use of Perl-style backrefs." about $1, $2, etc... you can can do this instead:

some_id = $1
# or
some_id = Regexp.last_match[1] if Regexp.last_match

some_id = $5
# or
some_id = Regexp.last_match[5] if Regexp.last_match

It'll also want you to do

%r{//}.match(some_string)

instead of

some_string[//]

Lame (Rubocop)

Upvotes: 9

sawa
sawa

Reputation: 168071

  • If you are using gsub(regex, replacement), then use '\1', '\2', ... to refer to the match. Make sure not to put double quotes around the replacement, or else escape the backslash as in Joshua's answer. The conversion from '\1' to the match will be done within gsub, not by literal interpretation.
  • If you are using gsub(regex){replacement}, then use $1, $1, ...

But for your case, it is easier not to use matches:

t2 = t.gsub(/(?<=[.\!?])(?=[A-Z1-9])/, "\n")

Upvotes: 27

Related Questions