Reputation: 2652
I downloaded a web page with a list of Brazilian cities. The vector of strings come as follows
vector_cities = strsplit("Nova Lima,São Paulo,Contagem,Rio de Janeiro,Rio de Janeiro,São Paulo,Castanhal,Diadema,Rio de Janeiro,Rio Verde,Porto Alegre,Maurilândia,Samambaia,Rio de Janeiro,Passo Fundo,São Paulo,Casimiro de Abreu,Rio de Janeiro,Barueri,Santos,São Paulo,São Paulo,Goiânia,Pelotas,Rio de Janeiro", ",")
vector_cities
[1] "Nova Lima" "São Paulo" "Contagem" "Rio de Janeiro" "Rio de Janeiro"
[6] "São Paulo" "Castanhal" "Diadema" "Rio de Janeiro" "Rio Verde"
[11] "Porto Alegre" "Maurilândia" "Samambaia" "Rio de Janeiro" "Passo Fundo"
[16] "São Paulo" "Casimiro de Abreu" "Rio de Janeiro" "Barueri" "Santos"
[21] "São Paulo" "São Paulo" "Goiânia" "Pelotas" "Rio de Janeiro"
I understand the coding of the above special characters, since this is the default encoding for html, however, I have tried many permutations of
iconv(vector_cities, from = "anything", to = "anything")
and they didn't return S(code)o = São or Sao, for example. Calling Encoding(vector_cities)
results in the following
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[11] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[21] "unknown" "unknown" "unknown" "unknown" "unknown"
What am I missing? Do I have to change something in the strings to get the right encoding?
Upvotes: 1
Views: 2108
Reputation: 11128
You can do the following, I have used stringi
function and a custom function to convert html #& to unicode equivalent, a function called stri_trans_general
from stringi
helped me translate these unicode converted into english alphabets. I have taken the xml parser from this link on SO itself
library(stringi)
vector_cities = strsplit("Nova Lima,São Paulo,Contagem,Rio de Janeiro,Rio de Janeiro,São Paulo,Castanhal,Diadema,Rio de Janeiro,Rio Verde,Porto Alegre,Maurilândia,Samambaia,Rio de Janeiro,Passo Fundo,São Paulo,Casimiro de Abreu,Rio de Janeiro,Barueri,Santos,São Paulo,São Paulo,Goiânia,Pelotas,Rio de Janeiro", ",")
vector_cities <- vector_cities[[1]]
library(XML)
html_txt <- function(str) {
xpathApply(htmlParse(str, asText=TRUE),
"//body//text()",
xmlValue)[[1]]
}
##The html_txt can parse the ã etc chars to their respective UTF values which can further be taken by stringi functions to convert into english alphabets
x <- vector_cities
txt <- html_txt(x)
Encoding(txt) <- "UTF-8" #encoding to utf-8, It is optional you may avoid it
splt_txt <-strsplit(txt,split="\n")[[1]]
stringi::stri_trans_general(splt_txt, "latin-ascii")
Output:
[1] "Nova Lima" "Sao Paulo"
[3] "Contagem" "Rio de Janeiro"
[5] "Rio de Janeiro" "Sao Paulo"
[7] "Castanhal" "Diadema"
[9] "Rio de Janeiro" "Rio Verde"
[11] "Porto Alegre" "Maurilandia"
[13] "Samambaia" "Rio de Janeiro"
[15] "Passo Fundo" "Sao Paulo"
[17] "Casimiro de Abreu" "Rio de Janeiro"
[19] "Barueri" "Santos"
[21] "Sao Paulo" "Sao Paulo"
[23] "Goiania" "Pelotas"
[25] "Rio de Janeiro"
Upvotes: 2