Reputation: 418
I am trying to extract text from a Spanish-language source in R, and running into a character encoding problem which is not resolved by explicitly specifying the encoding within htmlParse
, as recommended here.
library(XML)
library(httr)
url <- "http://www3.hcdn.gov.ar//folio-cgi-bin/om_isapi.dll?E1=&E11=&E12=&E13=&E14=&E15=&E16=&E17=&E18=&E2=&E3=&E5=ley&E6=&E7=&E9=&headingswithhits=on&infobase=proy.nfo&querytemplate=Consulta%20de%20Proyectos%20Parlamentarios&record={4EBB}&recordswithhits=on&softpage=Document42&submit=ejecutar%20"
doc <- htmlParse(rawToChar(GET(url)$content),encoding="windows-1252")
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
text[77]
The 77th element, which includes an accented i, has the offending characters. The fourth line has some additional hoops I have to jump through to read this source. The document itself claims to be encoded in "windows-1252." Specifying "latin1" and several other encodings I have tried are no better. In my actual application, I have already downloaded many of these files and am reading them locally using readLines
...and I can tell that the error is not present after reading the file into R, so the problem must be in htmlParse
. Also, just accepting the encoding error and correcting it ex post does not seem to be an option, as R does not even recognize the characters it is spitting out if I try to copy and paste them back into a script.
Upvotes: 0
Views: 245
Reputation: 420
Here is a quick fix that may work after you bring the file into R
Encoding(text) <- "UTF-8"
Changing the coding to "UTF-8" makes Spanish files a lot more usable.
Upvotes: 1