Nolan
Nolan

Reputation: 363

How does the following regex work?

Let's say I have a string in which I wanted to parse from an opening double-quote to a closing double-quote:

asdf"pass\"word"asdf

I was lucky enough to discover that the following PCRE would match from the opening double-quote to the closing double-quote while ignoring the escaped double-quote in the middle (to properly parse the logical unit):

".*?(?:(?!\\").)"

Match:

"pass\"word"

However, I have no idea why this PCRE matches the opening and closing double-quote properly.

I know the following:

" = literal double-quote

.*? = lazy matching of zero or more of any character

(?: = opening of non-capturing group

(?!\") = asserts its impossible to match literal \"

. = single character

) = closing of non-capturing group

" = literal double-quote

It appears that a single character and a negative lookahead are apart of the same logical group. To me , this means the PCRE is saying "Match from a double-quote to zero or more of any character as long as there is no \" right after the character, then match one more character and one single double quote."

However, according to that logic the PCRE would not match the string at all.

Could someone help me wrap my head around this?

Upvotes: 4

Views: 171

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89629

Nothing to add to Crayon Violent explanation, only a little disambiguation and ways to match substrings enclosed between double quotes (with eventually quotes escaped by a backslash inside).

First, it seems that you use in your question the acronym "PCRE" (Perl Compatible Regular Expression) that is the name of a particular regex engine (and by extension or somewhat imprecisely refers to its syntax) in place of the word "pattern" that is the regular expression that describes a group of other strings (whatever the regex engine used).

With Bash:

A='asdf"pass\"word"asdf'
pattern='"(([^"\\]|\\.)*)"'

[[ $A =~ $pattern ]]
echo ${BASH_REMATCH[1]}

You can use this pattern too: pattern='"(([^"\\]+|\\.)*)"'

With a PCRE regex engine, you can use the first pattern, but it's better to rewrite it in a more efficient way:

"([^"\\]*+(?:\\.[^"\\])*+)"

Note that for these three patterns don't need any lookaround. They are able to deal with any number of consecutive backslashes: "abc\\\"def" (a literal backslash and an escaped quote), "abcdef\\\\" (two literal backslashes, the quote is not escaped).

Upvotes: 0

CrayonViolent
CrayonViolent

Reputation: 32537

It's easier to understand if you change the non-capture group to be a capture group.

Lazy matching generally moves forward one character at a time (vs. greedy matching everything it can and then giving up what it must). But it "moves forward" as far as satisfying the required parts of the pattern after it, which is accomplished by letting the .*? match everything up to r, then letting the negative lookahead + . match the d.

Update: you asked in comment:

how come it matches up to the r at all? shouldn't the negative lookahead prevent it from getting passed the \" in the string? thanks for helpin me understand, by the way

No, because it is not the negative lookahead stuff that is matching it. That is why I suggested you change the non-captured group into a captured group, so that you can see it is .*? that matches the \", not (?:(?!\\").)

.*? has the potential to match the entire string, and the regex engine uses that to satisfy the requirement to match the rest of the pattern.

Update 2:

It is effectively the same as doing this: ".*?[^\\]" which is probably a lot easier to wrap your head around.

A (slightly) better pattern would be to use a negative lookbehind like so: ".*?(?<!\\)" because it will allow for an empty string "" to be matched (a valid match in many contexts), but negative lookbehinds aren't supported in all engines/languages (from your tags, pcre supports it, but I don't think you can really do this in bash except e.g. grep -P '[pattern]' .. which basically runs it through perl).

Upvotes: 2

Related Questions