Felipe Alvarenga
Felipe Alvarenga

Reputation: 2652

Convert default html encoding to UTF-8 or latin1 in R

I downloaded a web page with a list of Brazilian cities. The vector of strings come as follows

vector_cities = strsplit("Nova Lima,São Paulo,Contagem,Rio de Janeiro,Rio de Janeiro,São Paulo,Castanhal,Diadema,Rio de Janeiro,Rio Verde,Porto Alegre,Maurilândia,Samambaia,Rio de Janeiro,Passo Fundo,São Paulo,Casimiro de Abreu,Rio de Janeiro,Barueri,Santos,São Paulo,São Paulo,Goiânia,Pelotas,Rio de Janeiro", ",")

vector_cities
 [1] "Nova Lima"         "São Paulo"    "Contagem"          "Rio de Janeiro"    "Rio de Janeiro"   
 [6] "São Paulo"    "Castanhal"         "Diadema"           "Rio de Janeiro"    "Rio Verde"        
[11] "Porto Alegre"      "Maurilândia"  "Samambaia"         "Rio de Janeiro"    "Passo Fundo"      
[16] "São Paulo"    "Casimiro de Abreu" "Rio de Janeiro"    "Barueri"           "Santos"           
[21] "São Paulo"    "São Paulo"    "Goiânia"      "Pelotas"           "Rio de Janeiro" 

I understand the coding of the above special characters, since this is the default encoding for html, however, I have tried many permutations of

iconv(vector_cities, from = "anything", to = "anything")

and they didn't return S(code)o = São or Sao, for example. Calling Encoding(vector_cities) results in the following

    [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[11] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[21] "unknown" "unknown" "unknown" "unknown" "unknown"

What am I missing? Do I have to change something in the strings to get the right encoding?

Upvotes: 1

Views: 2108

Answers (1)

PKumar
PKumar

Reputation: 11128

You can do the following, I have used stringi function and a custom function to convert html #& to unicode equivalent, a function called stri_trans_general from stringi helped me translate these unicode converted into english alphabets. I have taken the xml parser from this link on SO itself

library(stringi)
vector_cities = strsplit("Nova Lima,São Paulo,Contagem,Rio de Janeiro,Rio de Janeiro,São Paulo,Castanhal,Diadema,Rio de Janeiro,Rio Verde,Porto Alegre,Maurilândia,Samambaia,Rio de Janeiro,Passo Fundo,São Paulo,Casimiro de Abreu,Rio de Janeiro,Barueri,Santos,São Paulo,São Paulo,Goiânia,Pelotas,Rio de Janeiro", ",")

vector_cities <- vector_cities[[1]]

library(XML)

html_txt <- function(str) {
  xpathApply(htmlParse(str, asText=TRUE),
             "//body//text()", 
             xmlValue)[[1]] 
}

##The html_txt can parse the &#227 etc chars to their respective UTF values which can further be taken by stringi functions to convert into english alphabets

x <- vector_cities 
txt <- html_txt(x)
Encoding(txt) <- "UTF-8" #encoding to utf-8, It is optional you may avoid it
splt_txt <-strsplit(txt,split="\n")[[1]]
stringi::stri_trans_general(splt_txt, "latin-ascii")

Output:

 [1] "Nova Lima"         "Sao Paulo"        
 [3] "Contagem"          "Rio de Janeiro"   
 [5] "Rio de Janeiro"    "Sao Paulo"        
 [7] "Castanhal"         "Diadema"          
 [9] "Rio de Janeiro"    "Rio Verde"        
[11] "Porto Alegre"      "Maurilandia"      
[13] "Samambaia"         "Rio de Janeiro"   
[15] "Passo Fundo"       "Sao Paulo"        
[17] "Casimiro de Abreu" "Rio de Janeiro"   
[19] "Barueri"           "Santos"           
[21] "Sao Paulo"         "Sao Paulo"        
[23] "Goiania"           "Pelotas"          
[25] "Rio de Janeiro"   

Upvotes: 2

Related Questions