Mabyn
Mabyn

Reputation: 317

Non-ASCII characters in R, reading from .sav file

I am trying to read a .sav file into RStudio. The file contains data from a Spanish language survey, and when I read it into R -- even though my default text encoding has already been set to ISO-8859-1 -- the display of special characters is incorrect.

For example, the word "Camión" appears as

"Cami<c3><b3>n" 

even though it shows up correctly as "Camión" in PSPP.

This is what I did:

install.packages("memisc")
jcv2014 <- as.data.set(spss.system.file('myfile.sav'))

Later, I wanted to create a list of just the variable labels, so I did the following:

library(foreign)
jcv2014.spss <- read.spss("myfile.sav", to.data.frame=FALSE, use.value.labels=FALSE)
jcv2014_vars <- attr(jcv2014.spss, "variable.labels")

(I'm not sure if this is the best way to do it, but it worked)

Anyway, this time around, I still didn't get the proper accents but there was a different sort of encoding:

A variable label that was supposed to be "¿Qué calificación le daría..." instead appeared as

"\302\277Qu\303\251 calificaci\303\263n le dar\303\255a..."

I'm not sure how to get the proper characters, but they appear correctly in PSPP. I tried changing the default text encoding in R to both ISO-8859-1 and UTF-8, to no avail. I don't know what the original file was encoded in, but I guessed it would be one of those.

Any ideas?

And if it helps, I have R version 3.1.1 and OS X Yosemite version 10.10.1, and I am using PSPP, not SPSS.

Thanks so much in advance!!!

Upvotes: 4

Views: 1514

Answers (1)

arvi1000
arvi1000

Reputation: 9582

Can you just set the encoding once you've read the data in?

# Here's your sentence
s <- "\302\277Qu\303\251 calificaci\303\263n le dar\303\255a..."

# it has no encoding
Encoding(s)
# [1] "unknown"

# but if you specify UTF-8, then it shows up correctly
iconv(s, 'UTF-8')
# [1] "¿Qué calificación le daría..."

# This also works
Encoding(s) <- 'UTF-8'
s
# [1] "¿Qué calificación le daría..."

Here are the results of my sessionInfo() call. You should post yours too.

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4     hexbin_1.27.0    ggplot2_1.0.0    data.table_1.9.2 yaml_2.1.13     
[6] redshift_0.4     RJDBC_0.2-4      rJava_0.9-6      DBI_0.3.1       

loaded via a namespace (and not attached):
 [1] colorspace_1.2-4 digest_0.6.4     grid_3.1.1       gtable_0.1.2     labeling_0.2    
 [6] lattice_0.20-29  MASS_7.3-33      munsell_0.4.2    plyr_1.8.1       proto_0.3-10    
[11] Rcpp_0.11.2      scales_0.2.4     stringr_0.6.2    tools_3.1.1  

Update: looks like you may not have a locale that supports UTF-8. Here are the locale settings for each category on my system. You might try using Sys.setLocale() and updating them one by one on your system (or just use LC_ALL if you don't feel the need to test each one incrementally). ?Sys.setLocale for more info

cat_str <- c("LC_COLLATE", "LC_CTYPE", "LC_MONETARY", "LC_NUMERIC",
             "LC_TIME", "LC_MESSAGES", "LC_PAPER", "LC_MEASUREMENT")
sapply(cat_str, Sys.getlocale)

# LC_COLLATE       LC_CTYPE    LC_MONETARY     LC_NUMERIC        LC_TIME    LC_MESSAGES 
# "en_US.UTF-8"  "en_US.UTF-8"  "en_US.UTF-8"            "C"  "en_US.UTF-8"  "en_US.UTF-8" 
# LC_PAPER LC_MEASUREMENT 
# ""             "" 

Upvotes: 2

Related Questions