Reputation: 187
I'm using R 2.15.0 on Windows 7 64-bit. I would like to output unicode (CJK) text to a file.
The following code shows how a Unicode character sent to write on a UTF-8 file connection does not work as (I) expected:
rty <- file("test.txt",encoding="UTF-8")
write("在", file=rty)
close(rty)
rty <- file("test.txt",encoding="UTF-8")
scan(rty,what=character())
close(rty)
As shown by the output of scan:
Read 1 item
[1] "<U+5728>"
The file was not written with the UTF character itself, but some kind of ANSI-compliant fallback. Can I make it work right the first time (i.e. with a text file that has "在" in it instead), or can I work some extra magic to convert the output to Unicode with the proper character replacing the code string?
Thanks.
[More info: the same code behaves properly in Cygwin, R 2.14.2, while 2.14.2 on Win7 is also broken. Is this on my end somewhere?]
Upvotes: 13
Views: 18997
Reputation: 356
For anyone coming upon this question later, see the stringi
package (https://cran.r-project.org/web/packages/stringi/index.html). It includes numerous functions to enable consistent, cross-platform UTF-8 string support in R. Most relevant to this thread, the stri_read_lines()
, stri_read_raw()
, and stri_write_lines()
functions can consistently input/output UTF-8, even on Windows.
Upvotes: 9
Reputation: 12860
The problem is due to some R-Windows special behaviour (using the default system coding / or using some system write functions; I do not know the specifics but the behaviour is actually known)
To write text UTF8 encoding on Windows one has to use the useBytes=T
options in functions like writeLines or readLines:
txt <- "在"
writeLines(txt, "test.txt", useBytes=T)
readLines("test.txt", encoding="UTF-8")
[1] "在"
Find here a really well written article by Kevin Ushey: http://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/ going into much more detail.
Upvotes: 24
Reputation: 1008
Saves UTF-8 strings in text file:
kLogFileName <- "parser.log"
log <- function(msg="") {
con <- file(kLogFileName, "a")
tryCatch({
cat(iconv(msg, to="UTF-8"), file=con, sep="\n")
},
finally = {
close(con)
})
}
Upvotes: 8
Reputation: 1008
I have such problem with UTF-8 strings which come from DB.
The only way I've found to save them properly is saving file in binary mode.
F <- file(file.name, "wb")
tryCatch({
writeBin(charToRaw(the_utf8_str), F)
},
finally = {
close(F)
})
Upvotes: 0
Reputation: 263342
I think you are having problems because write
is constructed so that it takes the name of an object and you do not appear to have build such a named object. Try this instead:
txt <- "在"
rty <- file("test.txt",encoding="UTF-8")
write(txt, file=rty)
close(rty)
rty <- file("test.txt",encoding="UTF-8")
inp <- scan(rty,what=character())
#Read 1 item
close(rty)
inp
#[1] "在"
Upvotes: 0