Jeff
Jeff

Reputation: 78

character encoding, dplyr with database (postgresql)

I've read the threads and package updates for encoding issues with Shiny, but I have a (difficult-to-reproduce example) database-driven Shiny app which is fumbling some special characters.

In my postgresql database I see correctly my Swedish river, "Upper Umeälven River", which - when I filter it back to the Shiny interface with dplyr: names.rivers <- filter(tbl.rivers, Country == "Sweden") ...becomes "Upper Umeälven River" in R.

I'm using UTF-8 encoding locally; I guess I'm losing something on the exchange with the database.

Sys.getlocale() [1] "LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252"

Apologies again for the lack of example, it's ONLY an issue pulling from the database. I suspect I'm missing a flag on some sanitizing function someplace, but need some help getting pointed the right direction.

Upvotes: 2

Views: 1139

Answers (2)

Jeff
Jeff

Reputation: 78

As suspected, the answer was simple: iconv(vector.to.convert, "UTF-8")

My "learnings":

  1. Encodings of the source file, the database, and data streams are not the same thing;
  2. I spent time making sure the data sources had been created in the correct encoding, ignoring the (implicit?) conversion of the datastream;
  3. This page helped: http://shiny.rstudio.com/articles/unicode.html

My understanding is a bit shallow, but - frankly - I'm not digging deeper into the world of character encoding for the moment. I hope it helps someone else avoid the error!

Upvotes: 1

Pekka
Pekka

Reputation: 3654

In your code page 1252 Windows Latin 1 the rendering for the 'ä' in Upper Umeälven River is to the code point 0xE4 (binary 11100100).

The Upper Umeälven River in the same code page has the two octets 0xC3A4 (XXX00011 XX100100).

However, if you consider the UTF-8 encoding rules of the code point, the significant bits are exactly the same.

Somewhere there is an inadvertent, or erroneous, character encoding taking place that transposes the character into UTF-8, but still considers the string to have the Windows Latin 1 code page.

Perhaps the data is already being received in UTF-8 and you can change the code page to receiving code page to reflect that. There may be a silent transformation happening somewhere further back, and no indication of this.

Upvotes: 1

Related Questions