Reputation: 1057
I'm having trouble handling escaped unicode characters in R, specifically those encountered when grabbing information from the MediaWiki API. I would find a JSON string like
{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}
Which should be perfectly valid but when read in through fromJSON()
I get:
snip...
[1] "Banach\023Tarski paradox"
Initially I thought this was just a problem with RJSONIO, but I encounter similar problems with scan()
and readLines()
. My guess is that I am missing something very basic.
I can't actually give a completely reproducible example using only R because if I send "em\u2013dash" to a file through write() (or some equivalent function) R will automatically convert the em dash. So here goes. Create a text file named test1 with the following:
"em\u2013dash" "em–dash" " em \u2013 dash"
Then load up R (for whatever the file path is):
> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash" "em–dash" " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""
The added escape character is what causes my problems with fromJSON()
. I could just strip them out but I'd probably break something else in the process and I imagine there is an easier solution. Thanks.
Here's the session info:
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/en_US.UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RJSONIO_0.98-0
loaded via a namespace (and not attached):
[1] tools_2.14.1
Upvotes: 10
Views: 7733
Reputation: 32978
I think the underlying problem is that the libjson
option JSON_UNICODE
is not enabled in RJSONIO
. However it seems like the problem does not manifest itself when the input is UTF-8
encoded:
library(RJSONIO)
x = "北京填鴨们"
identical(x, fromJSON(toJSON(x)))
# [1] TRUE
The problem only appears when the input uses JSON escaped characters. In these cases, RJSONIO
seems to generate latin1
output, but doesn't mark set the encoding correctly:
x <- fromJSON('["Z\\u00FCrich"]')
print(x)
# [1] "Z\xfcrich"
nchar(x)
#Error in nchar(x) : invalid multibyte string 1
For this simple example we can fix it by manually setting the encoding to latin1
:
#Set the correct encoding
Encoding(x) <- "latin1"
print(x)
#[1] "Zürich"
However, this of course won't work for characters outside the latin1
set:
#This should be: "填"
fromJSON('["\\u586B"]')
Upvotes: 1
Reputation: 141
This is not in fact a bug in RJSONIO. It is designed to expect a string that has been read by R and which has the non-ASCII characters already processed. When one passes it a string with \u, that has not been processed but escaped. On my machine with a locale set to en_US.UTF-8, the command
fromJSON('{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}')
produces
$query
$query$categorymembers
$query$categorymembers[[1]]
$query$categorymembers[[1]]$ns
[1] 0
$query$categorymembers[[1]]$title
[1] "Banach–Tarski paradox"
Note that the character is prefixed by \u
not \\u
.
See how it appears in R when you simply enter that string.
So the issue is upstream of fromJSON() as to why the string contains \u.
I may add support in RJSONIO for handling such unprocessed strings.
Upvotes: 9
Reputation: 13932
It is a bug in RJSONIO
as you can clearly see:
> RJSONIO::fromJSON('{"x":"foo\\u2013bar"}')
x
"foo\023bar"
It works just fine in rjson
:
> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo–bar"
and to prove it is the correct value:
> Sys.setlocale("LC_ALL", "C")
[1] "C/C/C/C/C/en_US.UTF-8"
> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo<U+2013>bar"
In your analysis you got confused by printed string vs actual strings. print
quotes its content for printing - if you want to see the actual string, you can use cat
or charToRaw
. Also scan
doesn't interpret any escapes, so you get what you give it.
Upvotes: 5