Reputation: 340
I use simpleXML to process xml file. It has Cyrillic characters. I also use dom_import_simplexml
, importNode
and appendChild
to copy trees from file to file and place to place.
At the end of processing I do print_r
of resulting simpleXmlElement
and everything is ok. But I also do asXml('outputfile.xml')
and something strange is going on: all cyrillic characters that was not wrapped with CDATA
(some tags bodies and all attributes) change to their unicode code.
For example, the output of print_r
(just a fragment):
SimpleXMLElement Object ( [@attributes] => Array
( [NAME] => Государственный аппарат и механизм
[COSTYES] => 3.89983579639 [COSTNO] => 0
[ID] => 9 )
[COMMENTYES] => Вы совершенно правы.
[COMMENTNO] => Нет, Вы ошиблись. ) ) )
But in file that asXml
generates, i get something like this:
<QUEST NAME="Теория#x434;вухмечей"
style="educ" ID="1">
<DESC><![CDATA[Теория происхождения государства, известная как теория "двух мечей" [2, с.40],
представляет из себя...
]]></DESC>`
I set utf-8 locale everywhere it's possible, googled every combination of words "simplexml, unicode, cyrillic, asXml, etc" but nothing worked.
UPD Looks like some function used does htmlentities()
. So, thanks to voitcus, the solution is to use html_entity_decode()
as adviced here.
Upvotes: 0
Views: 2518
Reputation: 2907
I wonder you might not declare encoding when you imported xml document at first. The following two give you different output.
$simplexml = simplexml_load_string('<QUEST NAME="Государственный" />');
if (!$simplexml) { exit('parse failed'); }
print_r($simplexml->asXml());
$simplexml = simplexml_load_string('<?xml version="1.0" encoding="UTF-8"?><QUEST NAME="Государственный" />');
if (!$simplexml) { exit('parse failed'); }
print_r($simplexml->asXml());
SimpleXMLElement object knows its own encoding from the original xml declaration, and if it was not declared, it generates numerical character references for safety, I guess.
Upvotes: 3