Reputation: 868
My string:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext [ref*summation]</p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
</div>
<p>fig1.2 [label*somefigure]</p>
<p>sometext [ref*somefigure]</p>
</div>
Objective: 1.In the string above label*string
and ref*string
are the cross references. In the place of [ref*string]
I need to replace with a
with the atributes of class
and href
, href is the id of div
where related label*
resides. And class of a
is the class of div
a
element class and ID is their relative div
class names and ID. But if div class="metadata"
exists, need to ignore it should not take their class name and ID.Expected output:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext <a class="section-ref" href="s9781473910270.i102">1.2</a></p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
<p>fig1.2 [label*somefigure]</p>
</div>
<p>sometext <a class="fig-ref" href="s9781473910270.i220">fig 1.2</a></p>
</div>
How to do it in simpler way without using DOM parser?
My idea is, have to store label* string
and their ID in an array and will loop against ref string to match the label* string
if string matches then their related id and class should be replaced in the place of ref* string
,
So I have tried this regex to get label*string
and their related id and class name.
Upvotes: 6
Views: 356
Reputation: 89557
This approach consists to use the html structure to retrieve needed elements with DOMXPath. Regex are used in a second time to extract informations from text nodes or attributes:
$classRel = ['sect2' => 'section-ref',
'figure' => 'fig-ref'];
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html); // or $dom->loadHTMLFile($url);
$xp = new DOMXPath($dom);
// make a custom php function available for the XPath query
// (it isn't really necessary, but it is more rigorous than writing
// "contains(@class, 'myClass')" )
$xp->registerNamespace("php", "http://php.net/xpath");
function hasClass($classNode, $className) {
if (!empty($classNode))
return in_array($className, preg_split('~\s+~', $classNode[0]->value, -1, PREG_SPLIT_NO_EMPTY));
return false;
}
$xp->registerPHPFunctions('hasClass');
// The XPath query will find the first ancestor of a text node with '[label*'
// that is a div tag with an id and a class attribute,
// if the class attribute doesn't contain the "metadata" class.
$labelQuery = <<<'EOD'
//text()[contains(., 'label*')]
/ancestor::div
[@id and @class and not(php:function('hasClass', @class, 'metadata'))][1]
EOD;
$idNodeList = $xp->query($labelQuery);
$links = [];
// For each div node, a new link node is created in the associative array $links.
// The keys are labels.
foreach($idNodeList as $divNode) {
// The pattern extract the first text part in group 1 and the label in group 2
if (preg_match('~(\S+) .*? \[label\* ([^]]+) ]~x', $divNode->textContent, $m)) {
$links[$m[2]] = $dom->createElement('a');
$links[$m[2]]->setAttribute('href', $divNode->getAttribute('id'));
$links[$m[2]]->setAttribute('class', $classRel[$divNode->getAttribute('class')]);
$links[$m[2]]->nodeValue = $m[1];
}
}
if ($links) { // if $links is empty no need to do anything
$refNodeList = $xp->query("//text()[contains(., '[ref*')]");
foreach ($refNodeList as $refNode) {
// split the text with square brackets parts, the reference name is preserved in a capture
$parts = preg_split('~\[ref\*([^]]+)]~', $refNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
// create a fragment to receive text parts and links
$frag = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k%2 && isset($links[$part])) { // delimiters are always odd items
$clone = $links[$part]->cloneNode(true);
$frag->appendChild($clone);
} elseif ($part !== '') {
$frag->appendChild($dom->createTextNode($part));
}
}
$refNode->parentNode->replaceChild($frag, $refNode);
}
}
$result = '';
$childNodes = $dom->getElementsByTagName('body')->item(0)->childNodes;
foreach ($childNodes as $childNode) {
$result .= $dom->saveXML($childNode);
}
echo $result;
Upvotes: 2
Reputation: 2507
This is not a task for regular expressions. Regular expressions are (usually) for regular languages. And what you want to do is some work on a context sensitive language (referencing an identifier which has been declared before).
So you should definately go with a DOM parser. The algorithm for this would be very easy, because you can operate on one node and it's children.
So the theoretical answer to your question is: you can't. Though it might work out with the many regex extensions in some crappy way.
Upvotes: -1