Reputation: 1531
To change tag pairs around text, this Postgres SELECT expression works for me:
select regexp_replace('The corpse of the huge <i>fin whale</i> created a spectacle on <span class="day">Friday</span> as <i>people</i> wandered the beach to observe it.',
'(<i>)([^/]+)(</i>)',
'<em>\2</em>',
'g');
I worry about excessive greed though on reference number two. My first try for reference number two was (.+) and that was a failure. The ([^/]+) works better. But I wonder if it is good enough.
Can anything be done to make that SELECT statement more robust?
Upvotes: 2
Views: 2540
Reputation: 44259
There generally two possibilities (and both seem to be supported by PostreSQL's regex engine).
Make the repetition ungreedy:
<i>(.+?)</i>
Use a negative lookahead to ensure that you consume anything except for </i>
:
<i>((?:(?!</i>).)+)</i>
In both cases, I removed the unnecessary captures. You can use \1
now in your replacement string.
These two should be equivalent in what they do. Their performance might vary though. The former needs backtracking, while the latter has to attempt the lookahead at every single position. Which one is faster would have to be profiled and might even depend on individual input strings. Note that, since the second pattern uses a greedy repetition, you could remove the trailing </i>
and you would still get the same results.
The approach you have is already robust in the sense that you can never go past a </i>
. But at the same time your approach does not allow nested tags (because the repetition could not go past the closing tag of the nested pair).
However, you should note that regular expressions are not really up to the job of parsing/manipulating HTML. What if there are extraneous spaces in your tags? Or what if the opening tag has attributes? Or what if one or both of the tags occur in attribute names or comments?
Upvotes: 6