jlesuffleur
jlesuffleur

Reputation: 1253

Encoding lost when reading XML in R

I am retrieving online XML data using the XML R packages. My issue is that the UTF-8 encoding is lost during the call to xmlToList : for instance, 'é' are replaced by 'é'. This happens during the XML parsing.

Here is a code snippet, with an example of encoding lost and another where encoding is kept (depending of the data source) :

library(XML)
library(RCurl)

url = "http://www.bdm.insee.fr/series/sdmx/data/DEFAILLANCES-ENT-FR-ACT/M.AZ+BE.BRUT+CVS-CJO?lastNObservations=2"
res <- getURL(url)
xmlToList(res)
# encoding lost

url2 = "http://www.bdm.insee.fr/series/sdmx/conceptscheme/"
res2 <- getURL(url2)
xmlToList(res2)
# encoding kept

Why the behaviour about encoding is different ? I tried to set .encoding = "UTF-8" in getURL, and to enc2utf8(res) but that makes no change.

Any help is welcome !

Thanks,

Jérémy

R version 3.2.1 (2015-06-18)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.95-4.7 bitops_1.0-6   XML_3.98-1.3  

loaded via a namespace (and not attached):
[1] tools_3.2.1

Upvotes: 2

Views: 623

Answers (1)

eblondel
eblondel

Reputation: 603

You are trying to read SDMX documents in R. I would suggest to use the rsdmx package that makes easier the reading of SDMX documents. The package is available on CRAN, you can also access the latest version on Github.

rsdmx allows you to read SDMX documents by file or url, e.g.

require(rsdmx)
sdmx = readSDMX("http://www.bdm.insee.fr/series/sdmx/data/DEFAILLANCES-ENT-FR-ACT/M.AZ+BE.BRUT+CVS-CJO?lastNObservations=2")
as.data.frame(sdmx)

Another approach is to use the web-service interface to embedded data providers, and INSEE is one of them. Try:

sdmx <- readSDMX(providerId = "INSEE", resource = "data",
                 flowRef = "DEFAILLANCES-ENT-FR-ACT",
                 key = "M.AZ+BE.BRUT+CVS-CJO", key.mode = "SDMX",
                 start = 2010, end = 2015)
as.data.frame(sdmx)

AFAIK the package also contains issues to the character encoding, but i'm currently investigating a solution to make available soon in the package. Calling getURL(file, .encoding="UTF-8") properly retrieves data, but encoding is lost calling xml functions.

Note: I also see you use a parameter lastNObservations. For the moment the web-service interface does not support extra parameters, but it may be made available quite easily if you need it.

Upvotes: 2

Related Questions