Paul
Paul

Reputation: 58

Regex capturing any text between

I'm trying to capture text (any text) that falls between some kind of delimiter with word boundaries on each end, like so:

This is not the text. ##This is the text I want to capture.## This is also not the text. ##But I would like to capture this, too##.

I thought this would be easy with regex like this

\b([#]{2})(.*)(\1)\b

This doesn't produce a match and I can't figure why.

Note, I would also like to avoid capturing the text between the first '##' and the last '##', capturing both sections with all the text in between.

In other words I don't want one of the matches to be:

##This is the text I want to capture.## This is also not the text. ##But I would like to capture this, too##

Upvotes: 2

Views: 660

Answers (1)

Mofi
Mofi

Reputation: 49086

georg and Ulugbek Umirov posted the perfect answer on this question as comment. I repeat the expression here with an explanation mainly to give the question an answer and therefore remove it from the list of unanswered questions.

##\b(.+?)## searches for a string

  • starting and ending with ## and
  • with a word character at beginning and
  • having 1 or more characters between.

Because of the parentheses the string found between ## is marked for backreference.

The question mark ? after the + multiplier changes the matching behavior from greedy to non greedy. The greedy expression .+ matches everything from first ## to last ## whereas the non greedy expression .+? matches just everything from first ## to next ##.

\b means word boundary and therefore the first character after ## must be a word character (letter, digit or underscore).

The matching behavior of . depends on a flag. The dot can match any character including line terminating characters, or any character except line terminating characters. Line terminating characters are carriage return (= \r = CR) and line feed (= newline = \n = LF).

If matching everything between two delimiter strings should be independent on matching behavior of the dot, it is better to use the regular expression ##\b([\w\W]+?)## like Ulugbek Umirov suggested as \w matches any word character and \W matches any non word character. Both in a character class definition matches therefore always any character including CR and LF.

It would be also possible to use ##\b([\s\S]+?)## where \s matches any whitespace character and \S matches any non whitespace character resulting with both in a character class definition in matching any character including CR and LF, too.

Further it would be possible to use ##(\w[\s\S]*?)## or ##\w([\w\W]*?)## or ##(\w.*?)## all resulting in the same matching behavior as all other expressions above, if the matching behavor for dot is any character including CR+LF.

Last, if the used regular expression engine supports lookbehind and lookahead, it would be also possible to match only the string between ## without matching the delimiters by using for example the regular expression (?<=##)\b[\w\W]+?(?=##) which makes the need of a marking group unnecessary. (?<=##) is a positive lookbehind expression and (?=##) is a positive lookahead expression both for the string ##.

Upvotes: 2

Related Questions