pescobar
pescobar

Reputation: 177

Convert html spanish accents in R

When I read a determinate spanish web site I get spanish accents in HTML encoding. I read the website with readLines function (I need use this function).

url <- "http://www.senamhi.gob.pe/include_mapas/_map_data_hist03.php?drEsta=01"
char_data <- readLines(url,encoding="UTF-8")

After making all operations to get my data I have a data frame where I have a variable with a character values that are words with accents. It would be something like:

var <- rep("Meteorol&oacute;gica",5)

I need to convert the spanish accents in HTML encoding to normal spanish accents. I tested with iconv function

iconv(var, "UTF-8", "ASCII")

But It doesn't work, I get the same vector of characters of input. Also I tested changing encoding option in readLines function but neither works.

How can I do? Thanks.

Upvotes: 2

Views: 992

Answers (2)

arvi1000
arvi1000

Reputation: 9582

Why not look up all the HTML &codes; for accented characters then find/replace?

library(rvest)

# scrape lookup table of accented char html codes, from the 2nd table on this page
ref_url <- 'http://www.w3schools.com/charsets/ref_html_8859.asp'
char_table <- html(ref_url) %>% html_table %>% `[[`(2)
# fix names
names(char_table) <- names(char_table) %>% tolower %>% gsub(' ', '_', .)

# here's a test string loaded with different html accents
test_str <- '&Agrave; &Aacute; &Acirc; &Atilde; &Auml; &Aring; &AElig; &Ccedil; &Egrave; &Eacute; &Ecirc; &Euml; &Igrave; &Iacute; &Icirc; &Iuml; &ETH; &Ntilde; &Ograve; &Oacute; &Ocirc; &Otilde; &Ouml; &times; &Oslash; &Ugrave; &Uacute; &Ucirc; &Uuml; &Yacute; &THORN; &szlig; &agrave; &aacute; &acirc; &atilde; &auml; &aring; &aelig; &ccedil; &egrave; &eacute; &ecirc; &euml; &igrave; &iacute; &icirc; &iuml; &eth; &ntilde; &ograve; &oacute; &ocirc; &otilde; &ouml; &divide; &oslash; &ugrave; &uacute; &ucirc; &uuml; &yacute; &thorn; &yuml;'

# use mgsub from here (it's just gsub with a for loop)
# http://stackoverflow.com/questions/15253954/replace-multiple-arguments-with-gsub
mgsub(char_table$entity_name, char_table$character, test_str)

And voil&agrave;, there you are:

"À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ"

Upvotes: 1

jack
jack

Reputation: 1391

I don't know R, but if you can include one line of javascript in it, this is the line:

var encoded = 'H&oacute;la';
var notEncoded = encoded.replace("&oacute;", "ò");

Then, get the notEncoded value in your .R program.

Upvotes: 1

Related Questions