How to get ID using a specific word in regex?

Question

My string:

       

1.2 Summations and Products[label*summation]
text 

           

sometext [ref*summation]





fig1.2 [label*somefigure]
sometext [ref*somefigure]

Objective: 1.In the string above label*string and ref*string are the cross references. In the place of [ref*string] I need to replace with a with the atributes of class and href, href is the id of div where related label* resides. And class of a is the class of div

As I mentioned above a element class and ID is their relative div class names and ID. But if div class="metadata" exists, need to ignore it should not take their class name and ID.

Expected output:

       

1.2 Summations and Products[label*summation]
text 

             

sometext 1.2





fig1.2 [label*somefigure]

sometext fig 1.2

How to do it in simpler way without using DOM parser?

My idea is, have to store label* string and their ID in an array and will loop against ref string to match the label* string if string matches then their related id and class should be replaced in the place of ref* string , So I have tried this regex to get label*string and their related id and class name.

Casimir et Hippolyte · Accepted Answer

This approach consists to use the html structure to retrieve needed elements with DOMXPath. Regex are used in a second time to extract informations from text nodes or attributes:

$classRel = ['sect2'  => 'section-ref',
             'figure' => 'fig-ref'];

libxml_use_internal_errors(true);

$dom = new DOMDocument;
$dom->loadHTML($html); // or $dom->loadHTMLFile($url); 

$xp = new DOMXPath($dom);

// make a custom php function available for the XPath query
// (it isn't really necessary, but it is more rigorous than writing
// "contains(@class, 'myClass')" )
$xp->registerNamespace("php", "http://php.net/xpath");

function hasClass($classNode, $className) {
    if (!empty($classNode))
        return in_array($className, preg_split('~\s+~', $classNode[0]->value, -1, PREG_SPLIT_NO_EMPTY));
    return false;
}

$xp->registerPHPFunctions('hasClass');


// The XPath query will find the first ancestor of a text node with '[label*'
// that is a div tag with an id and a class attribute,
// if the class attribute doesn't contain the "metadata" class.

$labelQuery = <<<'EOD'
//text()[contains(., 'label*')]
/ancestor::div
[@id and @class and not(php:function('hasClass', @class, 'metadata'))][1]
EOD;

$idNodeList = $xp->query($labelQuery);

$links = [];

// For each div node, a new link node is created in the associative array $links.
// The keys are labels. 
foreach($idNodeList as $divNode) {

    // The pattern extract the first text part in group 1 and the label in group 2
    if (preg_match('~(\S+) .*? \[label\* ([^]]+) ]~x', $divNode->textContent, $m)) {
        $links[$m[2]] = $dom->createElement('a');
        $links[$m[2]]->setAttribute('href', $divNode->getAttribute('id'));
        $links[$m[2]]->setAttribute('class', $classRel[$divNode->getAttribute('class')]);
        $links[$m[2]]->nodeValue = $m[1];
    }
}


if ($links) { // if $links is empty no need to do anything

    $refNodeList = $xp->query("//text()[contains(., '[ref*')]");

    foreach ($refNodeList as $refNode) {
        // split the text with square brackets parts, the reference name is preserved in a capture
        $parts = preg_split('~\[ref\*([^]]+)]~', $refNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);

        // create a fragment to receive text parts and links
        $frag = $dom->createDocumentFragment();

        foreach ($parts as $k=>$part) {
            if ($k%2 && isset($links[$part])) { // delimiters are always odd items
                $clone = $links[$part]->cloneNode(true);
                $frag->appendChild($clone);
            } elseif ($part !== '') {
                $frag->appendChild($dom->createTextNode($part));
            }
        }

        $refNode->parentNode->replaceChild($frag, $refNode);
    }
}

$result = '';

$childNodes = $dom->getElementsByTagName('body')->item(0)->childNodes;

foreach ($childNodes as $childNode) {
    $result .= $dom->saveXML($childNode);
}

echo $result;

How to get ID using a specific word in regex?

Answers (2)

Related Questions