MrVocabulary
MrVocabulary

Reputation: 655

Replace occurrences within capture groups using Regex

tl;dr: How do I replace only specific characters (i.e. line breaks) in a regex match in Ruby?

I have an array of strings. Each element of the array has between 2 and 4 words (= any sequence of characters) divided by spaces in a specific sequence.

I also have a large string in which I want to check for instances of those word sequences which are broken by \n instead of space. For example, I want to match an element of the array:

arr[0] = "aaa bbbb ccccc"

to a string that looks like this:

zzzzzzzzz aaa\n
bbbb ccccc yyyyyyyyy

And make it look like this:

zzzzzzzzz aaa bbbb ccccc yyyyyyyyy

The thing is, I can think of at least two ways of doing it, but they seem very cumbersome. What I would do is:

  1. replace each space in the array with [ \n]
  2. generate a regex with Regexp.union comprising all elements of the array
  3. use the regex to match instances of my arr elements in the string
  4. generate a .gsub! for each string so that it does not replace the entire match, but only elements of the match (or use multiple capture groups)

I suspect, however, that this is a rather silly way to do it. Is there a way to do it in Ruby that is less "around"?


EDIT: How to implement the answer below with regexp.union? I have a function that generates the regex:

def generateMergeRx(arr_with_keywords)
    arr_with_keywords.delete_if{|x| (x.include? " ") == false}
    matchRegexMerge = Regexp.new("(%{keywordReplace})" % {
        keywordReplace: Regexp.union(arr_with_keywords).source
    })
end

This is what it looks like using puts regexMerge.to_s:

(?-mix:(And\.\ z\ Kobyl\.|Ban\.\ W\.|B\.\ B\.|B\.\ G\.|Biel\.\ J\.)

It corresponds to that:

And. z Kobyl.
Ban. W.
B. B.
B. G.
Biel. J.
(...)

And then I call it like that:

regexMerge = generateMergeRx arr_with_keywords
some_string.gsub!(regexMerge.to_s.gsub!(/ /, "\s"), "\\1")

But what should I put instead of \1? Because at the moment input = output.

Upvotes: 0

Views: 977

Answers (1)

Aleksei Matiushkin
Aleksei Matiushkin

Reputation: 121000

▶ str = 'zzzzzzzzz aaa
▷ bbbb ccccc yyyyyyyyy'
▶ re = "aaa bbbb ccccc"
▶ str.gsub /#{re.gsub(/ +/, '\s+')}/, re
#⇒ "zzzzzzzzz aaa bbbb ccccc yyyyyyyyy"

The general idea is to match any spaces, including \n and to replace them with original string.

Upvotes: 2

Related Questions