Reputation: 460

Twitter regex only when not already a link

I know this has been done to death already. I've found lots of topics on the subject already and have taken lots of advice. However if I have the following string:

@testaccount
<a href="http://twitter.com/testaccount">@testaccount</a>

Obviously, I don't want to convert the second one to a link as it already is one. I've managed to find the first one without it being an email (thanks to several questions already here).

Here is the pattern I've got already:

/(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)/

That will convert the first one perfectly, but the second one will obviously become a 'double link'.

So I managed to work out that I should use something like this (?!<\/a>). However, that only removes the last t of testaccount.

Essentially, I need to find a way to ignore the whole match rather than just remove one character. Is this possible?

Language I'm using is PHP.

Thanks

Upvotes: 3

Answers (3)

Niet the Dark Absol

Reputation: 324640

Regex, bad. Parsing, good.

$dom = new DOMDocument();
$dom->loadHTML("<div>".$your_html_source_here."</div>",
                                      LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//text()[contains(.,'@')][not(ancestor::a)]");
foreach($nodes as $node) {
    // each of these nodes contains at least one @ to be processed
    // note that children of <a> tags are automatically ignored
    preg_match_all("/(?:^|(?<=\s))@\w+/",$node->nodeValue,$matches,
                                           PREG_PATTERN_ORDER|PREG_OFFSET_CAPTURE);
    // work backwards - it's easier
    foreach(array_reverse($matches[0]) as $match) {
        list($text,$offset) = $match;
        $node->splitText($offset+mb_strlen($text));
        $middle = $node->splitText($offset);
        // now wrap the text in a link:
        $link = $dom->createElement('a');
        $link->setAttribute("href","http://twitter.com/".substr($text,1));
        $node->parentNode->insertBefore($link,$middle);
        $link->appendChild($middle);
    }
}
// output
$result = substr(trim($dom->saveHTML()),strlen("<div>"),-strlen("</div>"));

(Note: The addition of <div> around the content is to ensure that there is a root element - otherwise parsing will encounter problems.)

Demonstration here

Upvotes: 0

Avinash Raj

Reputation: 174706

You need to add .*? before <\/a> inside that negative lookahead. So that it won't match @ strings which are already anchored.

(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z0-9_]+)(?!.*?<\/a>)

DEMO

Upvotes: 1

hwnd

Reputation: 70732

You could make effective use of (*SKIP) and (*FAIL) backtracking control verbs.

~<a[^<]*</a>(*SKIP)(*F)|@(\w+)~

The idea is to skip any content that is located between <a .. tags. On the left side of the alternation operator we match the subpattern we do not want, making it fail and forcing the regex engine to not retry the substring.

Live Demo

Upvotes: 2

Twitter regex only when not already a link

Answers (3)

Related Questions