Reputation: 77278
I have a regex 'simple' that I'd like to use as a building block for another regex 'complex'. The trouble is, the capture groups in 'simple' are interfering with 'complex'. These low level capture groups are details I don't to care about. I'd love to remove them before the regex is consumed.
The question is: how?
Put another way, in code, this isn't working well:
simple = /(a)bc/
complex = /(#{simple}) - (#{simple})/
complex.match("abc - abc").captures # => ["abc", "a", "abc", "a"]
// when I need ["abc","abc"]
I'd much rather write:
simple = /(a)bc/
complex = /(#{simple.without_capture}) - (#{simple.without_capture})/
complex.match("abc - abc").captures # => ["abc", "abc"]
I'm a stuck on how to do this, but I'm betting it's been done before. The implementation of Regex#without_capture
would need to of course account for non-capturing groups, look ahead/behind, etc. So simply removing all the () isn't enough. Also, finding the matching ) for capture groups seems a little challenging.
Thoughts?
EDIT: I forgot to mention. I don't want to manually create two versions of simple (a capturing and non-capturing). In my actual case it would be impractical to maintain both versions. It'd be much better to be able to toggle the capturing dynamically.
Upvotes: 5
Views: 16368
Reputation: 3454
I know this is an old question, however, I've written a refinement for a parser project with many expressions with capture groups for scanning which need identical non-capturing counterparts for splitting.
refine Regexp do
def decapture
Regexp.new(to_s.gsub(/\(\?<\w+>|(?<![^\\]\\)\((?!\?)/, '(?:'))
end
end
It works for capture groups as well as named capture groups, honors expression options, special groups and literal backslash/parenthesis couples. Here are the tests (Ruby 2.5):
describe :decapture do
it "should replace capture groups with non-capture groups" do
/(foo) baz (bar)/.decapture.must_equal /(?-mix:(?:foo) baz (?:bar))/
/(foo) baz (bar)/i.decapture.must_equal /(?i-mx:(?:foo) baz (?:bar))/
end
it "should replace named capture groups with non-capture groups" do
/(?<a>foo) baz (?<b>bar)/.decapture.must_equal /(?-mix:(?:foo) baz (?:bar))/
/(?<a>foo) baz (?<b>bar)/i.decapture.must_equal /(?i-mx:(?:foo) baz (?:bar))/
end
it "should not replace special groups" do
/(?:foo) (?<=baz) bar/.decapture.must_equal /(?-mix:(?:foo) (?<=baz) bar)/
end
it "should not replace literal round brackets" do
/\(foo\)/.decapture.must_equal /(?-mix:\(foo\))/
end
it "should replace literal backslash followed by literal round brackets" do
/\\(foo\\)/.decapture.must_equal /(?-mix:\\(?:foo\\))/
end
end
Upvotes: 1
Reputation: 77278
This is harder than I thought. Rather than spin more wheels if I change one requirement everything seems easy. Instead of trying to replace any capture group, replace only named capture groups.
Thanks @JustinMorgan and @TimPietzcker for getting me this far.
This is what I've come up with:
class Regexp
# replaces all named capture groups with non-capturing groups
# in other words, it replaces all (?<*>...) with (?:...)
def without_named_captures
named_captures = %r{\(\?<[^>]+>}
pattern = self.source.gsub(named_captures, "(?:")
Regexp.new(pattern)
end
end
Which passes this spec:
describe "Regexp Extensions" do
describe "#without_named_captures" do
it "should replace named captures with non-captures" do
p1 = /(?<a>.*) - (?<b>.*)/
p2 = p1.without_named_captures
p2.should == /(?:.*) - (?:.*)/
# sanity check
p1.match('abc - def').should have_exactly(3).items
p2.match('abc - def').should have_exactly(1).items
end
end
end
Dealing with recursion, escaping, and all the other junk, just goes away when the token is more complex than a single '('. If I use named captures everywhere, I can use this method. If I don't, well things behave normally.
It's late, so I don't know if I'm missing anything, but I think this'll work.
Thanks for the help everyone.
Upvotes: 1
Reputation:
You can switch all the capture groups in a particular regex to non-capture.
I don't really know Ruby regex flavor very well, buy you should get the jist
with this Perl example. I wote a superset regex to graphically annotate capture
buffers within a regex.
This is the light version, and generalized for common regex notation.
It generally does a global replacement with a call-back, testing a capt buffer
to determine which type of match we have.
Sorry if this is slightly complex.
Edit Note that this was originally used as an annotation regex using a global
search and no replacement. Coverting to non-capture groups could undermine the
original regexs' intent when it comes to back references to non-named capture groups.
use strict;
use warnings;
#
my $rxgroup = qr/
(?:
(?<!\\) # Not an escape behind us
( (?:\\.)* ) ## CaptGRP 1 - 0 or more "escape + any char"
( ## CaptGRP 2
# Exclude character class'
\[
\]?
(?: \\.| \[:[a-z]*:\] | [^\]\n] )*
\n?
(?: \\.| \[:[a-z]*:\] | [^\]] )*
\]
|
(?# Exclude extended comments )
\(\?\# [^)]* \)
|
# Exclude free comments
\# [^\n]*
|
# Start of a literal capture group
( \( ) ## CaptGRP 3
(?:
(?!\?) # unnamed: not a ? in front of us
## block for annotation only
## | # or (Perl 5.10 and above)
## # named: a ?<name> or ?'name' is ok
## (?= \?[<'][^\W\d][\w]*['>] )
)
)
)
/x;
#
my @samples = (
qr/ \(\$th(\\(?:.) [(] \\\\(.\)\\\(.)(i(s))\t(i(s)) ] )/x,
qr/
\(\$th(\\(?:.) [(]
(?# Extended lines
of comment
)
\\\\(.\)\\\(.)(i(s))\t(i(s)) ] )
/x,
$rxgroup
);
#
for (@samples)
{
print "\n\n", '='x20, "\nold: \n\n$_\n\n", '-'x10, "\n";
s/$rxgroup/ defined $3 ? "$1(?:" : "$1$2" /eg;
print "new: \n\n$_\n";
}
Output:
====================
old:
(?x-ism: \(\$th(\\(?:.) [(] \\\\(.\)\\\(.)(i(s))\t(i(s)) ] ))
----------
new:
(?x-ism: \(\$th(?:\\(?:.) [(] \\\\(?:.\)\\\(.)(?:i(?:s))\t(?:i(?:s)) ] ))
====================
old:
(?x-ism:
\(\$th(\\(?:.) [(]
(?# Extended lines
of comment
)
\\\\(.\)\\\(.)(i(s))\t(i(s)) ] )
)
----------
new:
(?x-ism:
\(\$th(?:\\(?:.) [(]
(?# Extended lines
of comment
)
\\\\(?:.\)\\\(.)(?:i(?:s))\t(?:i(?:s)) ] )
)
====================
old:
(?x-ism:
(?:
(?<!\\) # Not an escape behind us
( (?:\\.)* ) ## CaptGRP 1 - 0 or more "escape + any char"
( ## CaptGRP 2
# Exclude character class'
\[
\]?
(?: \\.| \[:[a-z]*:\] | [^\]\n] )*
\n?
(?: \\.| \[:[a-z]*:\] | [^\]] )*
\]
|
(?# Exclude extended comments )
\(\?\# [^)]* \)
|
# Exclude free comments
\# [^\n]*
|
# Start of a literal capture group
( \( ) ## CaptGRP 3
(?:
(?!\?) # unnamed: not a ? in front of us
## block for annotation only
## | # or (Perl 5.10 and above)
## # named: a ?<name> or ?'name' is ok
## (?= \?[<'][^\W\d][\w]*['>] )
)
)
)
)
----------
new:
(?x-ism:
(?:
(?<!\\) # Not an escape behind us
(?: (?:\\.)* ) ## CaptGRP 1 - 0 or more "escape + any char"
(?: ## CaptGRP 2
# Exclude character class'
\[
\]?
(?: \\.| \[:[a-z]*:\] | [^\]\n] )*
\n?
(?: \\.| \[:[a-z]*:\] | [^\]] )*
\]
|
(?# Exclude extended comments )
\(\?\# [^)]* \)
|
# Exclude free comments
\# [^\n]*
|
# Start of a literal capture group
(?: \( ) ## CaptGRP 3
(?:
(?!\?) # unnamed: not a ? in front of us
## block for annotation only
## | # or (Perl 5.10 and above)
## # named: a ?<name> or ?'name' is ok
## (?= \?[<'][^\W\d][\w]*['>] )
)
)
)
)
Upvotes: 0
Reputation: 30695
Well, the best way to do this would be to create two versions of "simple", but since you indicated you don't want to do that, you could try running "simple" through this regex:
/\((?!\?)/
...and replacing whatever matches that with (?:
. However, I want to emphasize that trying to process regex with regex makes me very nervous. I can't guarantee the above pattern won't produce false positives, depending on what you feed into it.
I know it won't properly handle an escaped open-parenthesis (that is, \(
meant to be interpreted as a literal (
character). You can mitigate that by using /(^|[^\\])\((?!\?)/
instead, and replacing it with $1(?:
, but that will produce false negatives if the backslash itself is escaped (i.e. \\(
meant to be interpreted as a literal backslash and the start of a group).
The real solution to this would be something like /(?<!(^|[^\\])(\\\\)*\\)\((?!\?)/
to check for an odd-numbered string of backslashes, but since Ruby doesn't support lookbehinds, I'd say go with /(^|[^\\])\((?!\?)/
or whatever seems most sane to you.
Upvotes: 5
Reputation: 3020
Well, I don't know in which cases this would fail but it's my try:
class MatchData
alias_method :captures_old, :captures
def captures(other = false)
unless other
self.captures_old
else
self.captures_old - other.match(self.to_s).captures_old
end
end
end
#example
basic = /(a)/
simple = /#{basic}b(c)/
complex = /(#{simple}) - (#{simple})/
#usual behavior
p basic.match("abc - abc").captures
p simple.match("abc - abc").captures
p complex.match("abc - abc").captures
#removes those from simple which also contain those from basic
p complex.match("abc - abc").captures(simple)
Upvotes: 1