MiksL
MiksL

Reputation: 53

Encoding issue when parsing XML in R

I am having problems with parsing xml's that contain non-latin character data. For example I am trying to parse following xml:

<PersonFullName>
  <PersonCode>
    9999999999999
  </PersonCode>
  <FirstName>
    ANDŽĀRS
  </FirstName>
  <LastName>
    DŽANDĒRĒKĀ
  </LastName>
</PersonFullName>

When I use following code

library(XML)
input <- xmlTreeParse(file = "test.xml", encoding = "UTF-8")
print(input)

I get following result

<?xml version="1.0" encoding="UTF-8"?>
<PersonFullNameVSAA>
  <PersonCode>9999999999999
                </PersonCode>
  <FirstName>ANDŽĀRS
                </FirstName>
  <LastName>DŽANDĒRĒKĀ
                </LastName>
</PersonFullNameVSAA>

The xml is correctly encoded in UTF-8. I don't know what else can I do to get characters in correct format.

Upvotes: 2

Views: 1358

Answers (1)

esel
esel

Reputation: 983

I had the same problem and none of the approaches in the comments worked.

My encoding on the Windows machine is Windows-1252, as can be determined by

Sys.getlocale("LC_CTYPE")

It seems that the XML package correctly parses the UTF-8, but returns strings with the local encoding, which may then be interpreted in the wrong way by R.

I collect my output in a data frame containing character vectors. The solution for me was to translate the resulting data frame with iconv:

apply(mydf, 2, function(x) iconv(x, from = "UTF-8", to = "Windows-1252"))

Upvotes: 1

Related Questions