Reputation: 127
We have a database and R scripts running on different virtual machines. First we connect to db
con <- dbConnect(
odbc(),
Driver = "SQL Server",
Server = "server", Database = "db", UID = "uid", PWD = "pwd",
encoding = "UTF-8"
)
and gathering the data
data <- dbGetQuery(con, "SELECT * FROM TableName")
The problem is following: when different machines running the same script, for some of these we face character variables encoding issues.
For example, this is what we have on machine A
> data$char_var[1]
[1] "фамилия"
> Encoding(data$char_var[1])
[1] "UTF-8"
> Sys.getlocale()
[1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> Encoding(data$char_var[1]) <- "1251"
> data$char_var[1]
[1] "гревцев"
and this is what we have on machine B
> data$char_var[1]
[1] "<e3><f0><e5><e2><f6><e5><e2>"
> Encoding(data$char_var[1])
[1] "UTF-8"
> Sys.getlocale()
[1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> Encoding(data$char_var[1]) <- "1251"
> data$char_var[1]
[1] "фамилия"
First script returns gibberish, but it prints the initial value correctly. The same code running on machine B initially prints utf-8 and then returns encoded values. What could be the reason for such a difference?
As a result we want a script that would have the same "фамилия" output value to show it on a dashboard.
Upvotes: 1
Views: 1340
Reputation: 44808
According to the result of your call to Encoding(data$char_var[1])
, both machines are declaring the returned results to be encoded using UTF-8.
On the first machine, that appears to be true, because you're seeing valid output. Then you mess it up by declaring the encoding incorrectly as "1251"
, and you see gibberish.
On the second machine, the result comes to you declared as UTF-8, but it isn't (which is why it looks like gibberish to start). When you change the declared encoding to be "1251"
it looks okay, so that's what it must have been all along.
So you have two choices:
Make sure both machines are consistent about what they return from the dbGetQuery
. You can handle either encoding, but you need to know what it is and make sure it's declared correctly.
Alternatively, try to detect what is being returned, and declare it appropriately. One way to do this might be to put a known string in the database and compare the result to that. If you know you should get "фамилия"
and you get something else, switch the declared encoding. You could also try the readr::guess_encoding()
function.
One other issue is that some downstream function might only be able handle one or the other of UTF-8 and 1251 encodings. (Windows R is really bad at non-native encodings, and UTF-8 is never native on Windows.) In that case you may want to actually convert to a common encoding. You use the iconv()
function to do that, e.g.
iconv(char_var, from = "cp1251", to = "UTF-8")
will try to convert to UTF-8.
Upvotes: 2