Reputation: 4048
Disclosure: I have read this answer many times here on SO and I know better than to use regex to parse HTML. This question is just to broaden my knowledge with regex.
Say I have this string:
some text <tag link="fo>o"> other text
I want to match the whole tag but if I use <[^>]+>
it only matches <tag link="fo>
.
How can I make sure that >
inside of quotes can be ignored.
I can trivially write a parser with a while loop to do this, but I want to know how to do it with regex.
Upvotes: 12
Views: 7447
Reputation: 974
If you want this to work with escaped double quotes, try:
/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g
For example:
const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g;
const nextGtMatch = () => ((exec) => {
return exec ? exec.index : -1;
})(gtExp.exec(xml));
And if you're parsing through a bunch of XML, you'll want to set .lastIndex
.
gtExp.lastIndex = xmlIndex;
const attrEndIndex = nextGtMatch(); // the end of the tag's attributes
Upvotes: 0
Reputation: 12158
(<.+?>[^<]+>)|(<.+?>)
you can make two regexs than put them togather by using '|', in this case :
(<.+?>[^<]+>) #will match some text <tag link="fo>o"> other text
(<.+?>) #will match some text <tag link="foo"> other text
if the first case match, it will not use second regex, so make sure you put special case in the firstplace.
Upvotes: 0
Reputation: 9591
<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>
I know this regex might be a headache to look at, so here is my explanation:
< # Open HTML tags
[^>]*? # Lazy Negated character class for closing HTML tag
(?: # Open Outside Non-Capture group
(?: # Open Inside Non-Capture group
('|") # Capture group for quotes, backreference group 1
[^'"]*? # Lazy Negated character class for quotes
\1 # Backreference 1
) # Close Inside Non-Capture group
[^>]*? # Lazy Negated character class for closing HTML tag
)* # Close Outside Non-Capture group
> # Close HTML tags
Upvotes: 16
Reputation: 1392
This is a slight improvement on Vasili Syrakis answer. It handles "…"
and '…'
completely separately, and does not use the *?
qualifier.
<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>
< # start of HTML tag
[^'">]* # any non-single, non-double quote or greater than
( # outer group
( # inner group
"[^"]*" # "..."
| # or
'[^']*' # '...'
) #
[^'">]* # any non-single, non-double quote or greater than
)* # zero or more of outer group
> # end of HTML tag
This version is slightly better than Vasilis's in that single quotes are allowed inside "…"
, and double quotes are allowed inside '…'
, and that a (incorrect) tag like <a href='>
will not be matched.
It is slightly worse than Vasili's solution in that the groups are captured. If you do not want that, replace (
with (?:
, in all places. (Just using (
makes the regex shorter, and a little bit more readable).
Upvotes: 1