Reputation: 2165
I'm trying to extract the data between a starting and ending markers in a string. There are multiple matches and I need to extract all the matches (into an array or list doesn't matter)
I have a limitation and cannot use Regex Matcher on my setup so as an alternative I'm looking at using string.split()
with a regex.
def str = "USELESS STUFF START:M A:STUFF1 B:MORE2 C:THAT3 END:M START:M A:STUFF4 B:MORE5 C:THAT6 END:M START:M A:STUFF7 B:MORE8 C:THAT9 END:M USELESS STUFF"
This pattern works with Regex Matcher and extracts all the matches between the starting and ending marker.
def items = str =~ /(?s)(?<=START:M).*?(?=END:M)/
Result:
[ A:STUFF1 B:MORE2 C:THAT3, A:STUFF4 B:MORE5 C:THAT6, A:STUFF7 B:MORE8 C:THAT9 ]
However, when I try to use the same pattern on string.split
def items = str.split(/(?s)(?<=START:M).*?(?=END:M)/)
it returns the end and start markers themselves for each match instead of what's between them.
[USELESS STUFF START:M, END:M START:M, END:M START:M, END:M USELESS STUFF]
What am I missing, why isn't the Split pattern returning the same groups as Matcher pattern?
Upvotes: 0
Views: 548
Reputation: 2599
This behavior corresponds well to the method names:
what text
?by what separator
?What Groovy does in this case is essentially pour some syntactic sugar over the standard Java APIs. The line def items = str =~ /(?s)(?<=START:M).*?(?=END:M)/
is the same as
Matcher items = Pattern.compile("(?s)(?<=START:M).*?(?=END:M)").matcher(str);
The groups found by this matcher will be
A:STUFF1 B:MORE2 C:THAT3
A:STUFF4 B:MORE5 C:THAT6
A:STUFF7 B:MORE8 C:THAT9
While the Matcher returns the matches, the Splitter, contrary, splits by them - it finds the parts of the text by the given regex and treats these as separators, cutting them out and returning what's left:
START:M
// A:STUFF1 B:MORE2 C:THAT3 is cut out since it's a separator
END:M START:M
// A:STUFF4 B:MORE5 C:THAT6 is a separator
END:M START:M
// A:STUFF7 B:MORE8 C:THAT9 is a separator
END:M
To actually get the data between START
and END
marks, str.split(" END:M START:M | START:M | END:M ")
would do. And the standard String methods like indexOf
, lastIndexOf
and substring
can be very heplful to get rid of the useless stuff and get only the needed groups by simply removing all content before first START:M
and after last END:M
:
str.substring(str.indexOf("START:M ") + 8, str.lastIndexOf(" END:M"))
.split(" END:M START:M ")
// or more groovy
str[str.indexOf("START:M ") + 8 .. str.lastIndexOf(" END:M") - 1]
.split(" END:M START:M ")
(8 is the length of START:M
)
Upvotes: 1