Reputation: 57184
This is one of the toughest things I have ever tried to do. Over the years I have searched but I just can’t find a way to do this — match a string not surrounded by a given char, like quotes or greater/less than symbols.
A regex like this could match URLs not in HTML links, SQL table.column values not in quotes, and lots of other things.
Example with quotes:
Match [THIS] and "something with [NOT THIS] followed by" or even [THIS].
Example with <,>, & "
Match [URL] and <a href="[NOT URL]">or [NOT URL]</a>
Example with single quotes:
WHERE [THIS] LIKE '%[NOT THIS]'
Basically, how do you match a string (THIS) when it is not surrounded by a given char?
\b(?:[^"'])([^"']+)(?:[^"'])\b
Here is a test pattern: a regex like what I am thinking of would match only the first "quote".
To quote, "quote me not lest I quote you!"
Upvotes: 11
Views: 15247
Reputation: 57184
As Alan M pointed out, you can use regex to look for an odd number thereby informing you of your position inside or outside any given string. Taking the quotes example, we seem really close to a solution to this problem. The only thing left is to handle escaped quotes. (I'm positive that nested quotes is almost impossible).
$string = 'Match THIS1 and "NOT THIS2" but THIS3 and "NOT "THIS4" or NOT THIS5" but \"THIS6\" is good and \\\\"NOT THIS7\\\\".';
preg_match_all('/[^"]+(?=(?:(?:(?:[^"\\\]++|\\\.)*+"){2})*+(?:[^"\\\]++|\\\.)*+$)/', $string, $matches);
Array (
[0] => Match THIS1 and
[1] => but THIS3 and
[2] => THIS4
[3] => but
[4] => THIS6
[5] => is good and \\
[6] => NOT THIS7\
[7] => .
)
Upvotes: 0
Reputation: 75222
The best solution will depend on what you know about the input. For example, if you're looking for things that aren't enclosed in double-quotes, does that mean double-quotes will always be properly balanced? Can they be escaped by with backslashes, or by enclosing them in single-quotes?
Assuming the simplest case--no nesting, no escaping--you could use a lookahead like this:
preg_match('/THIS(?=(?:(?:[^"]*+"){2})*+[^"]*+\z)/')
After finding the target (THIS), the lookahead basically counts the double-quotes after that point until the end of the string. If there's an odd number of them, the match must have occurred inside a pair of double-quotes, so it's not valid (the lookahead fails).
As you've discovered, this problem is not well suited to regular expressions; that's why all of the proposed solutions depend on features that aren't found in real regular expressions, like capturing groups, lookarounds, reluctant and possessive quantifiers. I wouldn't even try this without possessive quantifiers or atomic groups.
EDIT: To expand this solution to account for double-quotes that can be escaped with backslashes, you just need to replace the parts of the regex that match "anything that's not a double-quote":
[^"]
with "anything that's not a quote or a backslash, or a backslash followed by anything":
(?:[^"\\]|\\.)
Since backslash-escape sequences are relatively rare, it's worthwhile to match as many unescaped characters as you can while you're in that part of the regex:
(?:[^"\\]++|\\.)
Putting it all together, the regex becomes:
'/THIS\d+(?=(?:(?:(?:[^"\\]++|\\.)*+"){2})*+(?:[^"\\]++|\\.)*+$)/'
Applied to your test string:
'Match THIS1 and "NOT THIS2" but THIS3 and "NOT "THIS4" or NOT THIS5" ' +
'but \"THIS6\" is good and \\\\"NOT THIS7\\\\".'
...it should match 'THIS1'
, 'THIS3'
, 'THIS4'
and 'THIS6'
.
Upvotes: 18
Reputation: 297175
It is a bit tough. There are ways, as long as you don't need to keep track of nesting. For instance, let's avoid quoted stuff:
^((?:[^"\\]|\\.|"(?:[^"\\]|\\.)*")*?)THIS
Or, explaining:
^ Match from the beginning
( Store everything from the beginning in group 1, if I want to do replace
(?: Non-grouping aggregation, just so I can repeat it
[^"\\] Anything but quote or escape character
| or...
\\. Any escaped character (ie, \", for example)
| or...
" A quote, followed by...
(?: ...another non-grouping aggregation, of...
[^"\\] Anything but quote or escape character
| or...
\\. Any escaped character
)* ...as many times as possible, followed by...
" A (closing) quote
)*? As many as necessary, but as few as possible
) And this is the end of group 1
THIS Followed by THIS
Now, there are other ways of doing this, but, perhaps, not as flexible. For instance, if you want to find THIS, as long as there wasn't a preceeding "//" or "#" sequence -- in other words, a THIS outside a comment, you could do it like this:
(?<!(?:#|//).*)THIS
Here, (?<!...)
is a negative look-behind. It won't match these characters, but it will test that they do not appear before THIS.
As for any arbitrarily nested structures -- n (
closed by n )
, for example -- they can't be represented by regular expressions. Perl can do it, but it's not a regular expression.
Upvotes: 3
Reputation: 57184
After thinking about nesting elements ("a "this and "this"") and backslashed items "\"THIS\"" it seems that it really is true that this isn't a job for regex. However, the only thing that I can think of to solve this problem would be a regex like char-by-char parser that would mark $quote_level = ###; when finding and entering into a valid quote or sub quote. This way while in that part of the string you would know whether you were inside any given character even if it is escaped by a slash or whatever.
I guess with a char-by-char parser like this you could mark the string position of start/end quotes so that you could break up the string by quote segments and only process those outside the quotes.
Here is an example of how this parser would need to be smart enough to handle nested levels.
Match THIS and "NOT THIS" but THIS and "NOT "THIS" or NOT THIS" but \"THIS\" is good.
//Parser "greedy" looking for nested levels
Match THIS and "
NOT THIS"
but THIS and "
NOT "
THIS"
or NOT THIS"
but \"THIS\" is good
//Parser "ungreedy" trying to close nested levels
Match THIS and " " but THIS and " " THIS " " but \"THIS\" is good.
NOT THIS NOT or NOT THIS
//Parser closing levels correctly.
Match THIS and " " but THIS and " " but \"THIS\" is good.
NOT THIS NOT " " or NOT THIS
THIS
Upvotes: 0
Reputation: 51501
Well, regular expressions are just the wrong tool for this, so it is quite natural that it is hard.
Things "surrounded" by other things are not valid rules for regular grammars. Most (one could perhaps say, all serious) markup and programming languages are not regular. As long as there is no nesting involved, you may be able to simulate a parser with a regex, but be sure to understand what you are doing.
For HTML/XML, just use an HTML resp. XML parser; those exist for almost any language or web framework; using them typically involves just a few lines of code. For tables, you might be able to use a CSV parser, or, at a pinch, roll your own parser that extracts the parts inside/outside quotes. After extracting the parts you are interested in, you can use simple string comparison or regular expressions to get your results.
Upvotes: 0