Reputation: 460
I know this has been done to death already. I've found lots of topics on the subject already and have taken lots of advice. However if I have the following string:
@testaccount
<a href="http://twitter.com/testaccount">@testaccount</a>
Obviously, I don't want to convert the second one to a link as it already is one. I've managed to find the first one without it being an email (thanks to several questions already here).
Here is the pattern I've got already:
/(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)/
That will convert the first one perfectly, but the second one will obviously become a 'double link'.
So I managed to work out that I should use something like this (?!<\/a>)
. However, that only removes the last t
of testaccount
.
Essentially, I need to find a way to ignore the whole match rather than just remove one character. Is this possible?
Language I'm using is PHP.
Thanks
Upvotes: 3
Views: 53
Reputation: 324640
Regex, bad. Parsing, good.
$dom = new DOMDocument();
$dom->loadHTML("<div>".$your_html_source_here."</div>",
LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//text()[contains(.,'@')][not(ancestor::a)]");
foreach($nodes as $node) {
// each of these nodes contains at least one @ to be processed
// note that children of <a> tags are automatically ignored
preg_match_all("/(?:^|(?<=\s))@\w+/",$node->nodeValue,$matches,
PREG_PATTERN_ORDER|PREG_OFFSET_CAPTURE);
// work backwards - it's easier
foreach(array_reverse($matches[0]) as $match) {
list($text,$offset) = $match;
$node->splitText($offset+mb_strlen($text));
$middle = $node->splitText($offset);
// now wrap the text in a link:
$link = $dom->createElement('a');
$link->setAttribute("href","http://twitter.com/".substr($text,1));
$node->parentNode->insertBefore($link,$middle);
$link->appendChild($middle);
}
}
// output
$result = substr(trim($dom->saveHTML()),strlen("<div>"),-strlen("</div>"));
(Note: The addition of <div>
around the content is to ensure that there is a root element - otherwise parsing will encounter problems.)
Demonstration here
Upvotes: 0
Reputation: 174706
You need to add .*?
before <\/a>
inside that negative lookahead. So that it won't match @
strings which are already anchored.
(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z0-9_]+)(?!.*?<\/a>)
Upvotes: 1
Reputation: 70732
You could make effective use of (*SKIP)
and (*FAIL)
backtracking control verbs.
~<a[^<]*</a>(*SKIP)(*F)|@(\w+)~
The idea is to skip any content that is located between <a ..
tags. On the left side of the alternation operator we match the subpattern we do not want, making it fail and forcing the regex engine to not retry the substring.
Upvotes: 2