Reputation: 4671
I'm trying to parse the following XML file:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE content PUBLIC "-//BLACKWELL PUBLISHING GROUP//DTD 4.0//EN" "http://www.blackwellpublishing.com/xml/dtds/4-0/bpg4-0.dtd">
<content dtdver="4.0" docfmt="xml">
....
<forenames>NIELS BØIE</forenames><x> </x>
At first it wouldn't load, but now I have code which at least seems to use the DTD to figure out the entity like Ø
(Ø), but the next problem is that it doesn't show the character in the output.
This is my parsing code:
$options = LIBXML_DTDLOAD | LIBXML_NOENT | LIBXML_DTDVALID | LIBXML_NOCDATA;
$doc = simplexml_load_string ( $xml,null,$options );
echo $doc->document->header->namegroup->name->forenames."\n";
This is the output:
NIELS BIE
I tried it with DOM XML parsing too, and then the output was NIELS B IE (so with a space..)
any ideas?
Upvotes: 1
Views: 878
Reputation: 546253
Looking at the DTD, it says this (but without line breaks):
<!ENTITY Oslash
"<symbol name='Oslash' unicode='00D8'
type='html' glyph='@Oslash;' description='capital O, slash'
ascii='O' > </symbol>"
>
To any XML reader using this DTD, this means "Whenever you see this exact combination of letters in the source: Ø
, replace it with this text: <symbol name='Oslash' unicode... > </symbol>
This means that the XML data actually reads like this:
<forenames>NIELS B<symbol name='Oslash' unicode='00D8'
type='html' glyph='@Oslash;' description='capital O, slash'
ascii='O' > </symbol>IE</forenames>
...which explains why it's not showing up in your browser. The way around it would be to search your XML document for all <symbol>
elements, read the unicode
parameter and replace them with that.
Looking further at it, the comments at the top of the DTD show they've considered people in your situation! The glyph
attribute on the <symbol>
tag is the standard HTML entity to use for that symbol, but with the ampersand replaced with an @.
10 read xml document
20 search for any <symbol> element
30 read the "glyph" attribute
40 remove the <symbol> element
50 replace the @ with an & in glyph
60 write that in the place of <symbol>
70 goto 20
Upvotes: 3
Reputation: 4671
ok, got a bit further, if I user var_dump instead of echo I get this:
object(SimpleXMLElement)[22]
public 'symbol' =>
object(SimpleXMLElement)[21]
public '@attributes' =>
array
'name' => string 'Oslash' (length=6)
'unicode' => string '00D8' (length=4)
'type' => string 'html' (length=4)
'glyph' => string '@Oslash;' (length=8)
'description' => string 'capital O, slash' (length=16)
'ascii' => string 'O' (length=1)
string ' ' (length=1)
I wonder how I can use that to make a complete string together with the contents of forenames
Upvotes: 0
Reputation: 14603
If you have correct encoding you dont need to escape Ø
(Ø). Try to use unicode to be sure.
If there is no way to change the behavior try unescaping HTML entities, check PHP manual.
Upvotes: 1
Reputation: 29019
The DTD you are using with your XML file there doesn't contain the Oslash entity. As such the XML parser simply doesn't know what to do with Ø and confusion and/or hilarity ensues.
It is important to separate HTML's notion of named entities (of which Oslash is part) from XML's notion of named entities (apos, lt, gt, quot, amp). Basically, if it's not HTML, there's no Oslash (at least in the general case, some DTDs may have it, but it might not be the character you want at all.
In other words; always use UTF-8. Always.
EDIT: Ø is in latin-1, too.
Upvotes: 2