Reputation: 177
When I read a determinate spanish web site I get spanish accents in HTML encoding. I read the website with readLines
function (I need use this function).
url <- "http://www.senamhi.gob.pe/include_mapas/_map_data_hist03.php?drEsta=01"
char_data <- readLines(url,encoding="UTF-8")
After making all operations to get my data I have a data frame where I have a variable with a character values that are words with accents. It would be something like:
var <- rep("Meteorológica",5)
I need to convert the spanish accents in HTML encoding to normal spanish accents. I tested with iconv
function
iconv(var, "UTF-8", "ASCII")
But It doesn't work, I get the same vector of characters of input. Also I tested changing encoding
option in readLines
function but neither works.
How can I do? Thanks.
Upvotes: 2
Views: 992
Reputation: 9582
Why not look up all the HTML &codes;
for accented characters then find/replace?
library(rvest)
# scrape lookup table of accented char html codes, from the 2nd table on this page
ref_url <- 'http://www.w3schools.com/charsets/ref_html_8859.asp'
char_table <- html(ref_url) %>% html_table %>% `[[`(2)
# fix names
names(char_table) <- names(char_table) %>% tolower %>% gsub(' ', '_', .)
# here's a test string loaded with different html accents
test_str <- 'À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ'
# use mgsub from here (it's just gsub with a for loop)
# http://stackoverflow.com/questions/15253954/replace-multiple-arguments-with-gsub
mgsub(char_table$entity_name, char_table$character, test_str)
And voilà
, there you are:
"À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ"
Upvotes: 1
Reputation: 1391
I don't know R, but if you can include one line of javascript in it, this is the line:
var encoded = 'Hóla';
var notEncoded = encoded.replace("ó", "ò");
Then, get the notEncoded
value in your .R
program.
Upvotes: 1