Reputation: 73
Looking at the code below (in Ruby), why does the first instance reject the spaces when forming an array from the string while in the second instance it keeps the space? Basically, I am asking what the difference between /.../ and /(...)/ is in regex.
str = 'The quick brown fox jumps over the lazy dog.'
print str.split(/\s+/)
#["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog."]
print str.split(/(\s+)/)
#["The", " ", "quick", " ", "brown", " ", "fox", " ", "jumps", " ", "over", " ", "the", " ", "lazy", " ", "dog."]
Upvotes: 1
Views: 115
Reputation: 95252
Normally, parentheses around part of a regular expression capture the part of the text that matched that part of the regex, for use later in the regex or in the code after the match.
Example: "foofoo".match(/(.*)\1/)
is true because it contains a repeated string; the \1
means "whatever matched the part of this pattern between its first set of parentheses". In this case the .*
between the parentheses matches "foo" and the \1
matches the second copy of "foo".
But split
is different; it matches many times and doesn't have anywhere to put the text captured by parenthesized groups. So instead, the parentheses tell it to keep the delimiter in the result array. That way, if the regex being split on can match multiple characters, you can find out which one was in each place.
# no capture, nothing kept; nothing in the array to tell - from ,
'1-2,3'.split(/[-,]/) #=> ["1", "2", "3"]
# with capture, include delimiters; now array tells me which separators used
'1-2,3'.split(/([-,])/) #["1", "-", "2", ",", "3"]
This is another example of a behavior inherited from Perl, FWIW.
Upvotes: 2
Reputation: 521093
I think you wanted to explain the difference between:
print str.split(/\s+/) # split on whitespace and consume it
print str.split(/(\s+)/) # split on whitespace and capture it
In the first case, we are splitting on whitespace, and those whitespace characters are consumed, meaning that they will not appear in the output array. This leaves behind only the non-whitespace words. On the other hand, in the second version, we make the splitting whitespace a capture group by using (\s+)
. As a result, these captured whitespace characters make it into the output. As a result, we see all non whitespace groups and whitespace groups as separate in-order entries in the output array.
Upvotes: 3