ba3a
ba3a

Reputation: 340

SimpleXML outputs unicode in a strange way

I use simpleXML to process xml file. It has Cyrillic characters. I also use dom_import_simplexml, importNode and appendChild to copy trees from file to file and place to place. At the end of processing I do print_r of resulting simpleXmlElement and everything is ok. But I also do asXml('outputfile.xml') and something strange is going on: all cyrillic characters that was not wrapped with CDATA (some tags bodies and all attributes) change to their unicode code.

For example, the output of print_r (just a fragment):

SimpleXMLElement Object ( [@attributes] => Array 
             ( [NAME] => Государственный аппарат и     механизм 
               [COSTYES] => 3.89983579639 [COSTNO] => 0 
               [ID] => 9 )
           [COMMENTYES] => Вы совершенно         правы. 
          [COMMENTNO] => Нет, Вы ошиблись. ) ) )

But in file that asXml generates, i get something like this:

<QUEST NAME="&#x422;&#x435;&#x43E;&#x440;&#x438;&#x44F;#x434;&#x432;&#x443;&#x445;&#x43C;&#x435;&#x447;&#x435;&#x439;"     
    style="educ" ID="1">
  <DESC><![CDATA[Теория происхождения государства, известная как теория "двух мечей" [2, с.40], 
    представляет из себя...
  ]]></DESC>`

I set utf-8 locale everywhere it's possible, googled every combination of words "simplexml, unicode, cyrillic, asXml, etc" but nothing worked.

UPD Looks like some function used does htmlentities(). So, thanks to voitcus, the solution is to use html_entity_decode() as adviced here.

Upvotes: 0

Views: 2518

Answers (1)

akky
akky

Reputation: 2907

I wonder you might not declare encoding when you imported xml document at first. The following two give you different output.

$simplexml = simplexml_load_string('<QUEST NAME="Государственный" />');
if (!$simplexml) { exit('parse failed'); }
print_r($simplexml->asXml());

$simplexml = simplexml_load_string('<?xml version="1.0" encoding="UTF-8"?><QUEST NAME="Государственный" />');
if (!$simplexml) { exit('parse failed'); }
print_r($simplexml->asXml());

SimpleXMLElement object knows its own encoding from the original xml declaration, and if it was not declared, it generates numerical character references for safety, I guess.

Upvotes: 3

Related Questions