Zack
Zack

Reputation: 2497

How to grab all text inside of matching brackets with ruby and/or Regular Expressions

I am working on doing some code cleanup and need to make sure that my gsub! only runs on a small section of code. The portion of the code I need to examine starts with {{Infobox television (\{\{[Ii]nfobox\s[Tt]elevision to be technical) and ends with the matching double brackets "}}".

An example of the gsub! that will be run is text.gsub!(/\|(\s*)channel\s*=\s*(.*)\n/, "|\\1network = \\2\n")

...
{{Infobox television
 | show_name            = 60 Minutos
 | image                = 
 | director             = 
 | developer            = 
 | channel              = [[NBC]]
 | presenter            = [[Raúl Matas]] (1977–86)<br />[[Raquel Argandoña]] (1979–81)
 | language             = [[Spanish language|Spanish]]
 | first_aired          = {{Date|7 April 1975}}
 | website              = {{url|https://foo.bar.com}}
}}
...

Note:

Upvotes: 2

Views: 103

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627517

You may use a regex with a bit of recursion:

/(?=\{\{[Ii]nfobox\s[Tt]elevision)(\{\{(?>[^{}]++|\g<1>)*}})‌​/

Or, if there are single { or } inside, you will need to also match those with (?<!{){(?!{)|(?<!})}(?!}):

/(?=\{\{[Ii]nfobox\s[Tt]elevision)(\{\{(?>[^{}]++|(?<!{){(?!{)|(?<!})}(?!})|\g<1>)*}})/

See the Rubular demo

Details:

  • (?=\{\{[Ii]nfobox\s[Tt]elevision) - a positive lookahead making sure the current location is followed with {{Infobox television like string (with different casing)
  • (\{\{(?>[^{}]++|\g<1>)*}})‌​ - Group 1 that matches the following:
    • \{\{ - a {{ substring
    • (?>[^{}]++|\g<1>)* - zero or more occurrences of:
    • [^{}]++ - 1 or more chars other than { and }
    • (?<!{){(?!{) - a { not enclosed with other {
    • (?<!})}(?!}) - a } not enclosed with other }
    • | - or
    • \g<1> - the whole Group 1 subpattern
    • }} - a }} substring

Upvotes: 1

Ghoti
Ghoti

Reputation: 2380

Can't give you a direct answer without spending a lot of time on it.

But it is noteable that the first bracket set is at the beginning of a line, as is the last one.

So you have

^{{(.*)^}}$/m

The m means multiline match. That will match everything between the braces - the () brackets mean that you can pull out what was matched inside the braces, for example:

string = <<_EOT
{{Infobox television
 | show_name            = 60 Minutos
 | image                =
 | director             =
 | developer            =
 | channel              = [[NBC]]
 | presenter            = [[Raúl Matas]] (1977–86)<br />[[Raquel Argandoña]] (1979–81)
 | language             = [[Spanish language|Spanish]]
 | first_aired          = {{Date|7 April 1975}}
 | website              = {{url|https://foo.bar.com}}
}}


_EOT

matcher = string.match(^{{(.*)^}}$/m)

matcher[0] will give you the whole expression

matcher[1] will give you what was matched inside the () brackets

The danger with this is that it will do "greedy" matching and match the largest piece of text it can, so you will have to turn this off. Without more info on what you're trying to do I can't help any more.

NB - to match () brackets you have to escape them. See https://ruby-doc.org/core-2.1.1/Regexp.html for more info.

Upvotes: 0

Related Questions