Reputation: 58
I'm trying to capture text (any text) that falls between some kind of delimiter with word boundaries on each end, like so:
This is not the text. ##This is the text I want to capture.## This is also not the text. ##But I would like to capture this, too##.
I thought this would be easy with regex like this
\b([#]{2})(.*)(\1)\b
This doesn't produce a match and I can't figure why.
Note, I would also like to avoid capturing the text between the first '##' and the last '##', capturing both sections with all the text in between.
In other words I don't want one of the matches to be:
##This is the text I want to capture.## This is also not the text. ##But I would like to capture this, too##
Upvotes: 2
Views: 660
Reputation: 49086
georg and Ulugbek Umirov posted the perfect answer on this question as comment. I repeat the expression here with an explanation mainly to give the question an answer and therefore remove it from the list of unanswered questions.
##\b(.+?)##
searches for a string
##
andBecause of the parentheses the string found between ##
is marked for backreference.
The question mark ?
after the +
multiplier changes the matching behavior from greedy to non greedy. The greedy expression .+
matches everything from first ##
to last ##
whereas the non greedy expression .+?
matches just everything from first ##
to next ##
.
\b
means word boundary and therefore the first character after ##
must be a word character (letter, digit or underscore).
The matching behavior of .
depends on a flag. The dot can match any character including line terminating characters, or any character except line terminating characters. Line terminating characters are carriage return (= \r = CR) and line feed (= newline = \n = LF).
If matching everything between two delimiter strings should be independent on matching behavior of the dot, it is better to use the regular expression ##\b([\w\W]+?)##
like Ulugbek Umirov suggested as \w
matches any word character and \W
matches any non word character. Both in a character class definition matches therefore always any character including CR and LF.
It would be also possible to use ##\b([\s\S]+?)##
where \s
matches any whitespace character and \S
matches any non whitespace character resulting with both in a character class definition in matching any character including CR and LF, too.
Further it would be possible to use ##(\w[\s\S]*?)##
or ##\w([\w\W]*?)##
or ##(\w.*?)##
all resulting in the same matching behavior as all other expressions above, if the matching behavor for dot is any character including CR+LF.
Last, if the used regular expression engine supports lookbehind and lookahead, it would be also possible to match only the string between ##
without matching the delimiters by using for example the regular expression (?<=##)\b[\w\W]+?(?=##)
which makes the need of a marking group unnecessary. (?<=##)
is a positive lookbehind expression and (?=##)
is a positive lookahead expression both for the string ##
.
Upvotes: 2