Reputation: 1253
I am retrieving online XML data using the XML
R packages. My issue is that the UTF-8 encoding is lost during the call to xmlToList
: for instance, 'é' are replaced by 'é'. This happens during the XML parsing.
Here is a code snippet, with an example of encoding lost and another where encoding is kept (depending of the data source) :
library(XML)
library(RCurl)
url = "http://www.bdm.insee.fr/series/sdmx/data/DEFAILLANCES-ENT-FR-ACT/M.AZ+BE.BRUT+CVS-CJO?lastNObservations=2"
res <- getURL(url)
xmlToList(res)
# encoding lost
url2 = "http://www.bdm.insee.fr/series/sdmx/conceptscheme/"
res2 <- getURL(url2)
xmlToList(res2)
# encoding kept
Why the behaviour about encoding is different ? I tried to set .encoding = "UTF-8"
in getURL
, and to enc2utf8(res)
but that makes no change.
Any help is welcome !
Thanks,
Jérémy
R version 3.2.1 (2015-06-18)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.95-4.7 bitops_1.0-6 XML_3.98-1.3
loaded via a namespace (and not attached):
[1] tools_3.2.1
Upvotes: 2
Views: 623
Reputation: 603
You are trying to read SDMX documents in R. I would suggest to use the rsdmx package that makes easier the reading of SDMX documents. The package is available on CRAN, you can also access the latest version on Github.
rsdmx allows you to read SDMX documents by file
or url
, e.g.
require(rsdmx)
sdmx = readSDMX("http://www.bdm.insee.fr/series/sdmx/data/DEFAILLANCES-ENT-FR-ACT/M.AZ+BE.BRUT+CVS-CJO?lastNObservations=2")
as.data.frame(sdmx)
Another approach is to use the web-service interface to embedded data providers, and INSEE is one of them. Try:
sdmx <- readSDMX(providerId = "INSEE", resource = "data",
flowRef = "DEFAILLANCES-ENT-FR-ACT",
key = "M.AZ+BE.BRUT+CVS-CJO", key.mode = "SDMX",
start = 2010, end = 2015)
as.data.frame(sdmx)
AFAIK the package also contains issues to the character encoding, but i'm currently investigating a solution to make available soon in the package. Calling getURL(file, .encoding="UTF-8")
properly retrieves data, but encoding is lost calling xml
functions.
Note: I also see you use a parameter lastNObservations
. For the moment the web-service interface does not support extra parameters, but it may be made available quite easily if you need it.
Upvotes: 2