Mace
Mace

Reputation: 1269

Character encoding in R

I am trying to read a csv file generated by Sql Server Management Studio and encoded as UTF-8 (I chose that option when saving it) into R version 3.0.1 (x64) through read.csv2(). I can't get R to display special characters correctly.

If I set fileEncoding="UTF-8-BOM" the import stops at the line where I have a ÿ. However, when opening the file in Notepad++ the ÿ is displayed correctly with UTF-8 encoding. I have tried without setting fileEncoding, but then the special characters aren't displayed correctly (of course).

The csv flie is available here: https://www.dropbox.com/s/7y47i826ikq8ahi/Data.csv

How do I read the csv file and display the text in the right encoding?

Thanks!!

Upvotes: 4

Views: 20657

Answers (3)

Emeeus
Emeeus

Reputation: 5250

In my case, I have this issue in R inside a docker container (debian and R), when I ran locale in the container all variables appeared empty. I solve the problem adding this in the Dockerfile.

ENV LANG=en_US.UTF-8
ENV LC_CTYPE=en_US.UTF-8
ENV LC_NUMERIC=es_AR.UTF-8
ENV LC_TIME=es_AR.UTF-8
ENV LC_COLLATE=en_US.UTF-8
ENV LC_MONETARY=es_AR.UTF-8
ENV LC_MESSAGES=en_US.UTF-8
ENV LC_PAPER=es_AR.UTF-8
ENV LC_NAME=es_AR.UTF-8
ENV LC_ADDRESS=es_AR.UTF-8
ENV LC_TELEPHONE=es_AR.UTF-8
ENV LC_MEASUREMENT=es_AR.UTF-8
ENV LC_IDENTIFICATION=es_AR.UTF-8
ENV LC_ALL=C.UTF-8

I have es_AR in some values, but I think en_US or other should work.

Upvotes: 0

David
David

Reputation: 10152

To those that are still stuck with this issue. My scripts were able to recognise "umlaute" (ä, ö, ü, or ß) by including a line at the top of the script that changes the default option for character encoding options(encoding = "UTF-8") (In my case setting the options in RStudio direclty didn't effect the encodings!).

Upvotes: 2

Mace
Mace

Reputation: 1269

I found the answer my self. The problem was with the transformantion from UTF-8 to the system locale (the default encoding in R) through fileEncoding. As I use RStudio, I just changed the default encoding to UTF-8 and removed the fileEncoding="UTF-8-BOM" from read.csv. Then, the entire csv file was read and RStudio displays all characters correctly.

Upvotes: 5

Related Questions