Reputation: 4151
I'm trying to match any bracketed items within <sup>
tags.
My regular expression is being too greedy, starting with the first <sup>
tag and ending at the last </sup>
tag.
/<sup\b[^>]*>(.*?)\[(.*?)\](.*?)<\/sup>/
Example html:
<sup>[this should be gone]</sup>
<sup>but this should stay</sup>
<sup>this should [ also stay</sup>
[and this as well]
<sup><a href="#">[but this should definitely go]</a></sup>
Any idea why?
Thanks!
EDIT: I suppose these answers make sense. I've got much of the HTML parsed without regex; I just figured that this particular example would work with regex because it would do the following:
<sup>
tag</sup>
Upvotes: 0
Views: 203
Reputation: 4564
using regexp to parse html is usually not a very good idea.
see Parsing Html The Cthulhu Way
Upvotes: 0
Reputation: 19251
You probably cannot do this with one regular expression. You will need one that replaces using a callback function, which will run a separate regular expression.
the better method as everyone has mentioned would be to use a DOM object to parse the HTML first.
Upvotes: 0
Reputation: 23255
Isn't it the normal behavior? Have you specified the ungreedy option for your regexp?
Upvotes: 0
Reputation: 36622
You really can't do this. It's impossible to parse HTMl with regular expressions, because regular expressions can only match regular languages; these languages are a simpler subset of the actual languages we use. One very common non-regular language is the Dyck language of balanced brackets; it's impossible to match correctly nested parentheses with regular expressions. And HTML, if you think about it, is the same as this, with tags replacing parentheses. Thus, matching (a) correctly nested sup
tags is impossible, and (b) matching balanced braces is impossible. I don't use PHP myself, but I know it has access to an HTML DOM; I'd recommend using that instead. Then, filter through that for every sup
tag, and check each one's inner text. If you only want to catch tags whose inner text is just [...]
, where the ...
does not contain square brackets, you can use ^\[[^\]]+\]$
as your regex; if you want real nesting, more complicated checking is necessary.
Upvotes: 2
Reputation: 17314
If your requirement was to specifically remove any text inside "<sup>[
" and "]</sup
>", then you would be ok. But by your last example, you want to account for a nested tag as well, and probably arbitrary nested tags. So therefore I must remind you...
Upvotes: 0