OmaymaS
OmaymaS

Reputation: 1721

Force UTF-8 Encoding for imported multi-language table using Rmysql from SQL database

I am reading a table from a SQL database, with a column of cities in (English, Arabic and possibly other languages). The encoding type for all is Unknown, when I try to force them using Encoding, some change and some remain Unknown.

I tried using dbGetQuery but it is the same :

con <- dbConnect(RMySQL::MySQL(), host = "***",dbname="***",user = "***", password = "***")

dbGetQuery(con,"set names utf8")

Q1 <- dbSendQuery(con, "SELECT * FROM cities")

city <- fetch(Q1, n = -1)

> Encoding(city$name) %>% table()
.
unknown 
  45734 

When I force the change, some get converted, but entries with Arabic characters for example do not get converted.

> Encoding(city$name) <- "UTF-8"
> Encoding(city$name) %>% table()
.
unknown   UTF-8 
  44920     814 

And here is the result of SHOW VARIABLES LIKE 'character_set_%

dbSendQuery(con, "SET NAMES UTF8; ")
<MySQLResult:8,3,3>
> dbGetQuery(con, "SHOW VARIABLES LIKE 'character_set_%'")
             Variable_name                      Value
1     character_set_client                       utf8
2 character_set_connection                       utf8
3   character_set_database                     latin1
4 character_set_filesystem                     binary
5    character_set_results                       utf8
6     character_set_server                     latin1
7     character_set_system                       utf8
8       character_sets_dir /usr/share/mysql/charsets/

And here's the session info

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.1.2 tidyr_0.5.1   purrr_0.2.2   dplyr_0.5.0   RMySQL_0.10.9 DBI_0.4-1    

loaded via a namespace (and not attached):
[1] magrittr_1.5   R6_2.1.2       assertthat_0.1 tools_3.3.1    tibble_1.1     Rcpp_0.12.6

And here's the encoding in the db: table info from php my admin

Upvotes: 2

Views: 759

Answers (1)

Scarabee
Scarabee

Reputation: 5704

As said in the docs:

ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings.

Illustration:

v <- c("café", "floor", "window", "naïve")
Encoding(v) <- "UTF-8"

Encoding(v)
# [1] "UTF-8"   "unknown" "unknown" "UTF-8"

Some of your city names are in English, so they probably don't contain non-ASCII characters.

Upvotes: 1

Related Questions