Reputation: 11
From a string that contain a tei file, I generate an index to navigate to their blocks, I retrieve all the div tags, I also want to get, if present, the content of a tag (the tag <head>
) inside current div.
Example tei file:
<div type="lib" n="1"><head>LIBER I</head>...
<div type="pr">...</div>
<div type="cap" n="1"><head>CAP EX</head><p><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div>
<div type="cap" n="2"><head>CAP EX</head><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div>
</div>
I tried this but don't work:
//source file:
$fulltext = '<div type="lib" n="1"><head>LIBER I</head>...<div type="pr">...</div><div type="cap" n="1"><head>CAP EX</head><p><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div><div type="cap" n="2"><head>CAP EX</head><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div></div>';
$dom = new DOMDocument();
@$dom->loadHTML($fulltext);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//div");
echo '<ul>';
foreach ($entries as $entry){
$title = '';
type = $entry->getAttribute( 'type' );
$n = $entry->getAttribute( 'n' );
$head = $domx->evaluate("string(./head[1])",$entry);
if( $head != '' ) $title = $head; else $title = $n;
echo '<li><a href="#'.$type.'-'.$n.'">'.$title.'</li>';
}
echo '</ul>';
The line don't work:
$head = $domx->evaluate("string(./head[1])",$entry);
Error returned:
DOMDocument::loadHTML(): htmlParseStartTag: misplaced <head> tag in Entity, line: 3
The purpose of this line is to get the text of the child tag head inside the loop (in this example "LIBER I")
Upvotes: 0
Views: 77
Reputation: 11
Resolved using XMLReader:
$level = 0;
$indici_bc = array();
$indici_head = array();
$passed_milestone = false;
$xml = new XMLReader();
$xml->open($pathTei);
//$xml->xml($testo);
while ($xml->read()){
if($xml->nodeType == XMLReader::END_ELEMENT && $xml->name == 'div'){
$level--;
$last_blocco = $xml->name;
if($passed_milestone){ $level--; $passed_milestone = false; }
}
if($xml->nodeType == XMLReader::ELEMENT && ($xml->name == 'div' || $xml->name == 'milestone' )){
$blocco = $xml->name;
$type = $xml->getAttribute('type');
$n = $xml->getAttribute('n');
$unit = isset($xml->getAttribute('unit')) ? $xml->getAttribute('unit') : '';
//here I get the child node
$node = new SimpleXMLElement($xml->readOuterXML());
$head = $node->head ? (string)$node->head : '';
$indici_head[] = $head;
if($last_blocco != 'milestone') $level++;
if($blocco == 'div') $bc[$level] = $n; else $bc[($level+1)] = $n;
$bc_str = '';
for($j=1;$j<$level;$j++){
if( $bc_str != '' ) $bc_str.='.';
$bc_str.=$bc[$j];
}
if( $bc_str != '' ) $bc_str.='.';
$bc_str.=$n;
$last_blocco = $xml->name;
if( $blocco == 'milestone' ) $passed_milestone = true;
$indici_bc[]=$bc_str;
}
}
$xml->close();
Upvotes: 0
Reputation: 57121
Using the @
symbol on the load can hide all sorts of issues. So if you take it out you get errors with your document.
If however you change the line to
$dom->loadXML($fulltext);
The output gives you what your after.
Upvotes: 0