Dane O'Connor
Dane O'Connor

Reputation: 77278

Remove capture groups from a regex

I have a regex 'simple' that I'd like to use as a building block for another regex 'complex'. The trouble is, the capture groups in 'simple' are interfering with 'complex'. These low level capture groups are details I don't to care about. I'd love to remove them before the regex is consumed.

The question is: how?

Put another way, in code, this isn't working well:

simple = /(a)bc/
complex = /(#{simple}) - (#{simple})/
complex.match("abc - abc").captures # => ["abc", "a", "abc", "a"]
// when I need ["abc","abc"]

I'd much rather write:

simple = /(a)bc/
complex = /(#{simple.without_capture}) - (#{simple.without_capture})/
complex.match("abc - abc").captures # => ["abc", "abc"]

I'm a stuck on how to do this, but I'm betting it's been done before. The implementation of Regex#without_capture would need to of course account for non-capturing groups, look ahead/behind, etc. So simply removing all the () isn't enough. Also, finding the matching ) for capture groups seems a little challenging.

Thoughts?

EDIT: I forgot to mention. I don't want to manually create two versions of simple (a capturing and non-capturing). In my actual case it would be impractical to maintain both versions. It'd be much better to be able to toggle the capturing dynamically.

Upvotes: 5

Views: 16368

Answers (5)

svoop
svoop

Reputation: 3454

I know this is an old question, however, I've written a refinement for a parser project with many expressions with capture groups for scanning which need identical non-capturing counterparts for splitting.

refine Regexp do
  def decapture
    Regexp.new(to_s.gsub(/\(\?<\w+>|(?<![^\\]\\)\((?!\?)/, '(?:'))
  end
end

It works for capture groups as well as named capture groups, honors expression options, special groups and literal backslash/parenthesis couples. Here are the tests (Ruby 2.5):

describe :decapture do
  it "should replace capture groups with non-capture groups" do
    /(foo) baz (bar)/.decapture.must_equal /(?-mix:(?:foo) baz (?:bar))/
    /(foo) baz (bar)/i.decapture.must_equal /(?i-mx:(?:foo) baz (?:bar))/
  end

  it "should replace named capture groups with non-capture groups" do
    /(?<a>foo) baz (?<b>bar)/.decapture.must_equal /(?-mix:(?:foo) baz (?:bar))/
    /(?<a>foo) baz (?<b>bar)/i.decapture.must_equal /(?i-mx:(?:foo) baz (?:bar))/
  end

  it "should not replace special groups" do
    /(?:foo) (?<=baz) bar/.decapture.must_equal /(?-mix:(?:foo) (?<=baz) bar)/
  end

  it "should not replace literal round brackets" do
    /\(foo\)/.decapture.must_equal /(?-mix:\(foo\))/
  end

  it "should replace literal backslash followed by literal round brackets" do
    /\\(foo\\)/.decapture.must_equal /(?-mix:\\(?:foo\\))/
  end
end

Upvotes: 1

Dane O&#39;Connor
Dane O&#39;Connor

Reputation: 77278

This is harder than I thought. Rather than spin more wheels if I change one requirement everything seems easy. Instead of trying to replace any capture group, replace only named capture groups.

Thanks @JustinMorgan and @TimPietzcker for getting me this far.

This is what I've come up with:

class Regexp
  # replaces all named capture groups with non-capturing groups
  # in other words, it replaces all (?<*>...) with (?:...)
  def without_named_captures
      named_captures = %r{\(\?<[^>]+>}
      pattern = self.source.gsub(named_captures, "(?:")
      Regexp.new(pattern)
  end
end

Which passes this spec:

describe "Regexp Extensions" do
  describe "#without_named_captures" do
    it "should replace named captures with non-captures" do
      p1 = /(?<a>.*) - (?<b>.*)/
      p2 = p1.without_named_captures

      p2.should == /(?:.*) - (?:.*)/

      # sanity check
      p1.match('abc - def').should have_exactly(3).items
      p2.match('abc - def').should have_exactly(1).items
    end
  end
end

Dealing with recursion, escaping, and all the other junk, just goes away when the token is more complex than a single '('. If I use named captures everywhere, I can use this method. If I don't, well things behave normally.

It's late, so I don't know if I'm missing anything, but I think this'll work.

Thanks for the help everyone.

Upvotes: 1

user557597
user557597

Reputation:

You can switch all the capture groups in a particular regex to non-capture.
I don't really know Ruby regex flavor very well, buy you should get the jist
with this Perl example. I wote a superset regex to graphically annotate capture
buffers within a regex.

This is the light version, and generalized for common regex notation.
It generally does a global replacement with a call-back, testing a capt buffer
to determine which type of match we have.

Sorry if this is slightly complex.

Edit Note that this was originally used as an annotation regex using a global
search and no replacement. Coverting to non-capture groups could undermine the
original regexs' intent when it comes to back references to non-named capture groups.

use strict;
use warnings;

#
 my $rxgroup = qr/
    (?:
        (?<!\\)   # Not an escape behind us

        ( (?:\\.)* )  ## CaptGRP 1 - 0 or more "escape + any char"

        ( ## CaptGRP 2

             # Exclude character class'
              \[
                 \]?
                 (?: \\.| \[:[a-z]*:\] | [^\]\n] )*
                 \n?
                 (?: \\.| \[:[a-z]*:\] | [^\]] )*
              \]
           |
             (?# Exclude extended comments )
               \(\?\# [^)]* \)
           |
             # Exclude free comments
              \# [^\n]*

           |
             # Start of a literal capture group
             ( \(  )      ## CaptGRP 3
              (?:
                  (?!\?)    # unnamed: not a ? in front of us

                ## block for annotation only  
                ##  |           # or (Perl 5.10 and above)
                ##              # named: a ?<name> or ?'name' is ok
                ##    (?= \?[<'][^\W\d][\w]*['>] )
              )
        )
     )
  /x;

#
 my @samples = (

  qr/ \(\$th(\\(?:.) [(] \\\\(.\)\\\(.)(i(s))\t(i(s)) ] )/x,
  qr/
     \(\$th(\\(?:.) [(]
     (?# Extended lines
         of comment
     )
     \\\\(.\)\\\(.)(i(s))\t(i(s)) ] )
    /x,
  $rxgroup
 );

#
 for (@samples)
 {
    print "\n\n", '='x20, "\nold: \n\n$_\n\n", '-'x10, "\n";
    s/$rxgroup/ defined $3 ? "$1(?:" : "$1$2" /eg;
    print "new: \n\n$_\n";
 }

Output:

====================
old:

(?x-ism: \(\$th(\\(?:.) [(] \\\\(.\)\\\(.)(i(s))\t(i(s)) ] ))

----------
new:

(?x-ism: \(\$th(?:\\(?:.) [(] \\\\(?:.\)\\\(.)(?:i(?:s))\t(?:i(?:s)) ] ))


====================
old:

(?x-ism:
     \(\$th(\\(?:.) [(]
     (?# Extended lines
         of comment
     )
     \\\\(.\)\\\(.)(i(s))\t(i(s)) ] )
    )

----------
new:

(?x-ism:
     \(\$th(?:\\(?:.) [(]
     (?# Extended lines
         of comment
     )
     \\\\(?:.\)\\\(.)(?:i(?:s))\t(?:i(?:s)) ] )
    )


====================
old:

(?x-ism:
    (?:
        (?<!\\)   # Not an escape behind us

        ( (?:\\.)* )  ## CaptGRP 1 - 0 or more "escape + any char"

        ( ## CaptGRP 2

             # Exclude character class'
              \[
                 \]?
                 (?: \\.| \[:[a-z]*:\] | [^\]\n] )*
                 \n?
                 (?: \\.| \[:[a-z]*:\] | [^\]] )*
              \]
           |
             (?# Exclude extended comments )
               \(\?\# [^)]* \)
           |
             # Exclude free comments
              \# [^\n]*

           |
             # Start of a literal capture group
             ( \(  )      ## CaptGRP 3
              (?:
                  (?!\?)    # unnamed: not a ? in front of us

                ## block for annotation only
                ##  |           # or (Perl 5.10 and above)
                ##              # named: a ?<name> or ?'name' is ok
                ##    (?= \?[<'][^\W\d][\w]*['>] )
              )
        )
     )
  )

----------
new:

(?x-ism:
    (?:
        (?<!\\)   # Not an escape behind us

        (?: (?:\\.)* )  ## CaptGRP 1 - 0 or more "escape + any char"

        (?: ## CaptGRP 2

             # Exclude character class'
              \[
                 \]?
                 (?: \\.| \[:[a-z]*:\] | [^\]\n] )*
                 \n?
                 (?: \\.| \[:[a-z]*:\] | [^\]] )*
              \]
           |
             (?# Exclude extended comments )
               \(\?\# [^)]* \)
           |
             # Exclude free comments
              \# [^\n]*

           |
             # Start of a literal capture group
             (?: \(  )      ## CaptGRP 3
              (?:
                  (?!\?)    # unnamed: not a ? in front of us

                ## block for annotation only
                ##  |           # or (Perl 5.10 and above)
                ##              # named: a ?<name> or ?'name' is ok
                ##    (?= \?[<'][^\W\d][\w]*['>] )
              )
        )
     )
  )

Upvotes: 0

Justin Morgan
Justin Morgan

Reputation: 30695

Well, the best way to do this would be to create two versions of "simple", but since you indicated you don't want to do that, you could try running "simple" through this regex:

/\((?!\?)/

...and replacing whatever matches that with (?:. However, I want to emphasize that trying to process regex with regex makes me very nervous. I can't guarantee the above pattern won't produce false positives, depending on what you feed into it.

I know it won't properly handle an escaped open-parenthesis (that is, \( meant to be interpreted as a literal ( character). You can mitigate that by using /(^|[^\\])\((?!\?)/ instead, and replacing it with $1(?:, but that will produce false negatives if the backslash itself is escaped (i.e. \\( meant to be interpreted as a literal backslash and the start of a group).

The real solution to this would be something like /(?<!(^|[^\\])(\\\\)*\\)\((?!\?)/ to check for an odd-numbered string of backslashes, but since Ruby doesn't support lookbehinds, I'd say go with /(^|[^\\])\((?!\?)/ or whatever seems most sane to you.

Upvotes: 5

derp
derp

Reputation: 3020

Well, I don't know in which cases this would fail but it's my try:

class MatchData
    alias_method :captures_old, :captures
    def captures(other = false)
        unless other
            self.captures_old
        else
            self.captures_old - other.match(self.to_s).captures_old
        end
    end
end

#example
basic = /(a)/
simple = /#{basic}b(c)/
complex = /(#{simple}) - (#{simple})/

#usual behavior
p basic.match("abc - abc").captures
p simple.match("abc - abc").captures
p complex.match("abc - abc").captures
#removes those from simple which also contain those from basic
p complex.match("abc - abc").captures(simple)

Upvotes: 1

Related Questions