Reputation: 520
My experience with Regex is a little more than intro, so this is a challenge. Perhaps some math/physics/someone can figure it out...
We have to wrap certain words/phrases with a <span class="tooltip"></span>
so that a relevant tooltip is displayed for the contents of the span. The challenge comes in how to avoid not wrapping a word twice if it is part of another phrase that was already wrapped.
The example: "Use Twitter Analyzer for analytics".
Both Twitter and Twitter Analyzer have tooltips, but only the Twitter Analyzer needs to be wrapped in the above. This is achieved by ensuring we search for the longest phrases first.
How do you prevent (using only Regular Expressions) the shorter phrase of the two from being wrapped again if it is already wrapped in another span?
Furthermore, Twitter and Twitter Analytics are only two examples of an entire list, so it needs to be generic.
Any ideas?
Upvotes: 4
Views: 285
Reputation: 4144
I think your best bet is to match individual phrases you are looking for, and for each hit, save the string offset for the beginning of the match. Once you have built your list of offsets, sort the offsets from lowest to highest. For each offset in the list, compute the end offset of the string by adding the string length. If any of the later items in the list have an offset less than this new offset, remove them. If two offsets in the list are the same, take the longer of the two strings and throw the other out.
In your given example, the offset would be 4 for "Twitter Analyzer" and 4 for "Twitter" For the sake of demonstration, say you were also interested in "Analyzer" which has an offset of 12. The sorted list would be:
offset 4 - Twitter Analyzer - length 16
offset 4 - Twitter - length 7
offset 12 - Analyzer - length 8
since there are two 4's, throw out the one with the shorter length. Then add the length of "twitter analyzer" to its offset to get 20. Any offsets less than 20 but greater than 4 get thrown out.
To insert the string, retain your list of start and end offsets and start at the end of the list. At end offsets insert a "</span>" and at begin offsets insert "<span class="tooltip>" Move backward in the string until you reach the front. This will allow you to make the substitutions without the need to recalculate offsets.
Upvotes: 2
Reputation: 520
Michaelc gave a good suggestion to use negative lookahead. What about negative lookbehind?
You should then get away with:
$match = '/(?<!\<span class="tooltip">)Twitter/';
$replace = '<span class="tooltip">\0</span>';
$output = preg_replace($match, $replace, $input);
We wouldn't need to maintain the matchlist and could build a match item as we go through the word/phrase list. Down side is, like what eyelidlessness said, you will have a problem with overlaps like "Foo Bar" and "Bar Baz". Yet, you could interrogate the matches found to see if they don't contain a <span class="tooltip">
or a </span>
. Not 100% accurate though.
Comments?
Upvotes: 0
Reputation: 114004
And now for the obligatory "you can't parse HTML with regex" link: RegEx match open tags except XHTML self-contained tags
Upvotes: 2
Reputation: 1821
If you can store the list to be matched in regex form, you could use negative lookahead to ensure each match is distinct. You would need access to PCRE functions. And an example:
$match = array('/Twitter(?! Analyzer)/', '/Twitter Analyzer/');
$replace = '<span class="tooltip">\0</span>';
$output = preg_replace($match, $replace, $input);
I probably don't need to mention that this will make maintaining your match list more difficult.
Upvotes: 1
Reputation: 272687
You cannot do this using only regex. Regular expressions cannot match for an arbitrary number of balanced opening and closing tags (because this doesn't form a regular language). You will need to perform the count yourself.
Upvotes: 1