John Smith
John Smith

Reputation: 1698

writeLines behavior with special characters

When running this following line in R (R Studio):

writeLines("hello \U1F30D",useBytes = T)

I get different results.

With a PC, I get

hello ðŸŒ

or

writeLines("hello \U1F30D",useBytes = F)
hello <U+0001F30D>

And with a mac

writeLines("hello \U1F30D",useBytes = F)
hello 🌍

I think that the behavior is not due to the machine. It should be the encoding. But I checked the encoding of R Studio, it is UTF-8 for both. So now I have no idea why there are different behavior, could anyone explain the differences?

Upvotes: 1

Views: 477

Answers (1)

Kevin Ushey
Kevin Ushey

Reputation: 21285

I wrote a somewhat long-form answer to this question here: https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/

The short answer: writeLines("<text>", useBytes = FALSE) will attempt to re-encode the provided text to the native encoding. This works on Unix systems using a UTF-8 locale (which is the default nowadays) but will fail when this is not the case (ie, on Windows). In effect, you need something like:

writeLines("<text>", file, useBytes = TRUE)
readLines(file, encoding = "UTF-8")

Note that diagnosing encoding issues on Windows can be challenging as R will fairly aggressively re-encode UTF-8 text into the native encoding (sometimes attempting to round-trip UTF-8 -> native -> UTF-8) and that conversion is usually lossy.

Upvotes: 2

Related Questions