pixeline
pixeline

Reputation: 17974

xpath: exclude text that is part of an attribute value

I need to parse a chunk of html looking for a specific term, and wrap all instances of that term inside a A tag (with class "keyword").

To do that, i parse the html bit using xpath and it works well...

$nodes = $xpath->query('//text()[contains(., "CLEA")]');

except in some rare cases, where the term is inside an attribute value, in which case some recursion occurs and the html breaks:

Hello <a class="tag" title="this is <a class="tag" href="#">CLEA</a>">CLEA</a>, hello!

When what i want is

Hello <a class="tag" title="this is CLEA">CLEA</a>, hello!

I'm struggling to correct my xpath query to exclude text that is part of an attribute value.

Your help would be much appreciated, thank you.

Here is a sample of the html that is being parsed using Xpath:

<?xml version="1.0" encoding="UTF-8"?>
<p>
Carte Blanche aux Artistes du <a class="tag" href="?tag=clea" rel="tag-definition" title="Click here to learn more about CLEA">CLEA</a>
14.01 - 19.01.2013
at: 
Gare Numérique de Jeumont, France
Organised by:
DRAC, Nord-Pas de Calais
Education National Nord-Pas de Calais
In the context of :
CLEA, résidence-mission
Contrat Local d'Education Artistique
http://cleavaldesambre.wordpress.com/
With: Martin Mey, Stephane Querrec, Woudi Tat, Marie Morel, LAb[au]
LAb[au] featured projects: <a title="Click here to learn more about f5x5x1" href="?tag=f5x5x1" rel="tag-definition" class="tag">Framework f5x5x1</a>, kinetic light art installation
<a title="Click here to learn more about binary waves" href="?tag=binary+waves" rel="tag-definition" class="tag">binary waves</a>, cybernetic light art installation</p>

update 2 The xpath is used in php like this

    $dom = new DOMDocument('1.0', 'utf8');
    $dom->formatOutput = true;
    $dom->loadHTML(mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'));
    $xpath = new DOMXPath($dom);
    foreach ($tags as $t) {
        $label = $t['label'];
        $nodes = $xpath->query('//text()[contains(., "' . $label . '")]');
        $urlVersion = htmlentities(urlencode($label));

        foreach ($nodes as $node) {
            $link = '<a class="tag" rel="tag-definition" title="Click to know more about ' . $label . '" href="?tag='.$urlVersion.'">'.$label.'</a>';
            $replaced = str_replace($label, $link, $node->textContent);
            $newNode = $dom->createDocumentFragment();
            $newNode->appendChild(new DOMText($replaced));
            $node->parentNode->replaceChild($newNode, $node);
        }
    }

    $text= $dom->saveHTML();

The error occurs because one tag is "les amis de CLEA" and another tag is "CLEA".

Upvotes: 1

Views: 1335

Answers (1)

Shaun McCance
Shaun McCance

Reputation: 474

That expression should not return attribute values. This looks like a bug in the PHP XPath implementation. In Xpath // is short for /descendant-or-self::node()/. Descendants do not include attributes. Even if they did, text() without an axis is short for child::text(), and attributes do not have child nodes. http://www.w3.org/TR/xpath/#axes

So you need a workaround. The fully expanded expression you're using is /descendant-or-self::node()/child::text()[contains(., "CLEA")]. So let's try tweaking that. Instead of node(), try *, which only matches elements:

/descendant-or-self::*/text()[contains(., "CLEA")]

Or try using the text() node test directly on the descendant-or-self axis:

/descendant-or-self::text()[contains(., "CLEA")]

Upvotes: 1

Related Questions