Reputation: 1893
This question is related to this previous one on how to replace accented strings like México
with equivalent Latex
code M\'{e}xico
.
My problem here is slightly different. I am using a third party database with string variables with Spanish accents like above. However, the encoding appears odd since this is the behavior I get:
> grep("México",temp$dest_nom_ent)
integer(0)
> grep("Mexico",temp$dest_nom_ent)
integer(0)
> grep("xico",temp$dest_nom_ent)
[1] 18 19 20
> temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
[2] "México" "México" "México"
where temp$dest_nom_ent
is a variable with state names of México.
My question, then, is how to convert the string variable from the third party database into an encoding that standard R
functions will recognize. Please note:
> Encoding(temp$dest_nom_ent)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[15] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[22] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[29] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[36] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[43] "unknown" "unknown"
For further info I am using Windows 7 64. Also note:
> charToRaw(temp$dest_nom_ent[18])
[1] 4d e9 78 69 63 6f
Which from this source coincides with Windows Spanish (Traditional Sort) locale.
M=4d
é=e9
x=78
i=69
c=63
o=6f
And also note:
> charToRaw("México")
[1] 4d c3 a9 78 69 63 6f
> Encoding("México")
[1] "latin1"
I have tried the following unsuccessfully (e.g. meaning grep("é",temp$dest_nom_ent)
returns null vector):
Encoding(temp$dest_nom_ent)<-"latin1"
temp$dest_nom_ent <- iconv(temp$dest_nom_ent,"","latin1")
temp$dest_nom_ent <- enc2utf8(temp$dest_nom_ent)
...
I checked supported character sets using iconvlist()
and "WINDOWS-1252"
is supported. The following, however, did not work:
> temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
> temp1
[1] "México" "México" "México"
> Encoding(temp1)<-"WINDOWS-1252"
> temp1 <- iconv(temp1,"WINDOWS-1252","latin1")
> temp1
[1] "México" "México" "México"
> Encoding(temp1)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp1[1])
[1] 4d e9 78 69 63 6f
> grep("é",temp1)
integer(0)
which compares to:
> temp2 <- c("México","México","México")
> temp2
[1] "México" "México" "México"
> Encoding(temp2)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp2[1])
[1] 4d c3 a9 78 69 63 6f
> grep("é",temp2)
[1] 1 2 3)
Tried to find out the encoding by brute force like:
try(for(i in 1:length(iconvlist())){
temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
Encoding(temp1)<-iconvlist()[i]
temp1 <- iconv(temp1,iconvlist()[i],"latin1")
print(grep("é",temp1))
print(i)
},silent=FALSE)
I am not familiar with try
function but it still scapes at error instead of ignoring it so cannot check whole list:
...
[1] 17
integer(0)
[1] 18
integer(0)
[1] 19
integer(0)
[1] 20
Error in iconv(temp1, iconvlist()[i], "latin1") :
unsupported conversion from 'CP-GR' to 'latin1' in codepage 1252
Finally:
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> d<-c("México","México")
> for(i in 1:7){d1 <- str_sub(d[1],i,i); print(d1)}
[1] "M"
[1] "Ã"
[1] "©
[1] "x"
[1] "i"
[1] "c"
[1] "o"
> print(grep("é",d))
[1] 1 2
So it seems I will have to change the computer's locale as suggested here. Also see here
PS: In case you wonder how with an English_United States.1252 locale I managed to type d<-c("México","México")
the way is by setting up a secondary Spanish keyboard (traditional sort) using Control Panel > Clock, Language and Region > Region and Language > Keyboards and Languages > Change Keyboards
and under installed services
click add and navigate to Spanish traditional sort. Then under advanced key settings
you can create a short-cut to switch keyboards. In my case Shit+Alt
. So if I want to type ñ
in English default locale, I do Shift+Alt
followed by ;
and then Shift+Alt
to go back to English keyboard.
Upvotes: 4
Views: 697
Reputation: 1893
Well, I could not determine the coding of accents but the following accomplishes what I wanted. The trick was to convert to UTF-8, set the sub()
option useBytes=TRUE
and Joran's suggestion to use sanitize.text.function=function(x){x}
for xtable()
. Here is the sample code. Easy to loop over all accented vowels:
> temp1 <- unique(temp$dest_nom_ent)
> temp1
[1] "Aguascalientes" "Baja California"
[3] "Baja California Sur" "Campeche"
[5] "Coahuila de Zaragoza" "Colima"
[7] "Chiapas" "Guanajuato"
[9] "Guerrero" "Hidalgo"
[11] "Jalisco" "México"
[13] "Michoacán de Ocampo" "Morelos"
[15] "Nayarit" "Oaxaca"
[17] "Puebla" "Querétaro"
[19] "Quintana Roo" "San Luis Potosí"
[21] "Sinaloa" "Tabasco"
[23] "Tlaxcala" "Veracruz de Ignacio de la Llave"
[25] "Zacatecas"
> temp1 <- iconv(unique(temp1),"","UTF-8")
> temp1
[1] "Aguascalientes" "Baja California"
[3] "Baja California Sur" "Campeche"
[5] "Coahuila de Zaragoza" "Colima"
[7] "Chiapas" "Guanajuato"
[9] "Guerrero" "Hidalgo"
[11] "Jalisco" "México"
[13] "Michoacán de Ocampo" "Morelos"
[15] "Nayarit" "Oaxaca"
[17] "Puebla" "Querétaro"
[19] "Quintana Roo" "San Luis Potosí"
[21] "Sinaloa" "Tabasco"
[23] "Tlaxcala" "Veracruz de Ignacio de la Llave"
[25] "Zacatecas"
> Encoding(temp1)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[8] "unknown" "unknown" "unknown" "unknown" "UTF-8" "UTF-8" "unknown"
[15] "unknown" "unknown" "unknown" "UTF-8" "unknown" "UTF-8" "unknown"
[22] "unknown" "unknown" "unknown" "unknown"
> temp2 <- sub("é", "\\\\'{e}", temp1, useBytes = TRUE)
> temp2 <- data.frame(temp2)
> print(xtable(temp2),sanitize.text.function=function(x){x})
% latex table generated in R 2.13.1 by xtable 1.5-6 package
% Fri Jul 15 13:52:44 2011
\begin{table}[ht]
\begin{center}
\begin{tabular}{rl}
\hline
& temp2 \\
\hline
1 & Aguascalientes \\
2 & Baja California \\
3 & Baja California Sur \\
4 & Campeche \\
5 & Coahuila de Zaragoza \\
6 & Colima \\
7 & Chiapas \\
8 & Guanajuato \\
9 & Guerrero \\
10 & Hidalgo \\
11 & Jalisco \\
12 & M\'{e}xico \\
13 & Michoacán de Ocampo \\
14 & Morelos \\
15 & Nayarit \\
16 & Oaxaca \\
17 & Puebla \\
18 & Quer\'{e}taro \\
19 & Quintana Roo \\
20 & San Luis Potosí \\
21 & Sinaloa \\
22 & Tabasco \\
23 & Tlaxcala \\
24 & Veracruz de Ignacio de la Llave \\
25 & Zacatecas \\
\hline
\end{tabular}
\end{center}
\end{table}
As actually implemented in a loop:
temp$dest_nom_ent <- iconv(
temp$dest_nom_ent,"","UTF-8")
temp$dest_nom_mun <- iconv(
temp$dest_nom_mun,"","UTF-8")
accents <-c("á","é","í","ó","ú")
latex <-c("\\\\'{a}","\\\\'{e}","\\\\'{i}","\\\\'{o}","\\\\'{u}")
for(i in 1:5){
temp$dest_nom_ent<-sub(accents[i], latex[i],
temp$dest_nom_ent, useBytes = TRUE)
temp$dest_nom_mun<-sub(accents[i], latex[i],
temp$dest_nom_ent, useBytes = TRUE)
}
capture.output(
print(xtable(temp),sanitize.text.function=function(x){x}),
file = "../paper/rTables.tex", append = FALSE)
Still, the answer is incomplete in that I cannot explain what exactly was going on. Found it through trial and error.
Upvotes: 0
Reputation: 263471
Try setting encoding of the string to one of "ISO_8859-1" "ISO_8859-15".
Two more suggestions..., then I give up: "UTF-16" "UTF-16LE" . The second is UTF little-endian I believe and have read that it is what Windows 7 actually uses. Might as well try "UTF-16BE" as well. (Material garnered from another stackexchange posting; https://superuser.com/questions/221593/windows-7-utf-8-and-unicode )
Upvotes: 0
Reputation: 121157
Take a look at what the encodings of temp$dest_nom_ent
and "México" are, using Encoding(x)
. You may need to convert with enc2native
or enc2utf8
.
Upvotes: 1