Reputation: 53
I am having problems with parsing xml's that contain non-latin character data. For example I am trying to parse following xml:
<PersonFullName>
<PersonCode>
9999999999999
</PersonCode>
<FirstName>
ANDŽĀRS
</FirstName>
<LastName>
DŽANDĒRĒKĀ
</LastName>
</PersonFullName>
When I use following code
library(XML)
input <- xmlTreeParse(file = "test.xml", encoding = "UTF-8")
print(input)
I get following result
<?xml version="1.0" encoding="UTF-8"?>
<PersonFullNameVSAA>
<PersonCode>9999999999999
</PersonCode>
<FirstName>ANDŽĀRS
</FirstName>
<LastName>DŽANDĒRĒKĀ
</LastName>
</PersonFullNameVSAA>
The xml is correctly encoded in UTF-8. I don't know what else can I do to get characters in correct format.
Upvotes: 2
Views: 1358
Reputation: 983
I had the same problem and none of the approaches in the comments worked.
My encoding on the Windows machine is Windows-1252, as can be determined by
Sys.getlocale("LC_CTYPE")
It seems that the XML package correctly parses the UTF-8, but returns strings with the local encoding, which may then be interpreted in the wrong way by R.
I collect my output in a data frame containing character vectors. The solution for me was to translate the resulting data frame with iconv
:
apply(mydf, 2, function(x) iconv(x, from = "UTF-8", to = "Windows-1252"))
Upvotes: 1