Adam Hyland
Adam Hyland

Reputation: 1057

How to correctly deal with escaped Unicode Characters in R e.g. the em dash (—)

I'm having trouble handling escaped unicode characters in R, specifically those encountered when grabbing information from the MediaWiki API. I would find a JSON string like

{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}

Which should be perfectly valid but when read in through fromJSON() I get:

snip...
[1] "Banach\023Tarski paradox"

Initially I thought this was just a problem with RJSONIO, but I encounter similar problems with scan() and readLines(). My guess is that I am missing something very basic.

I can't actually give a completely reproducible example using only R because if I send "em\u2013dash" to a file through write() (or some equivalent function) R will automatically convert the em dash. So here goes. Create a text file named test1 with the following:

"em\u2013dash" "em–dash" " em \u2013 dash"

Then load up R (for whatever the file path is):

> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash"    "em–dash"          " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""

The added escape character is what causes my problems with fromJSON(). I could just strip them out but I'd probably break something else in the process and I imagine there is an easier solution. Thanks.

Here's the session info:

R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RJSONIO_0.98-0

loaded via a namespace (and not attached):
[1] tools_2.14.1

Upvotes: 10

Views: 7733

Answers (3)

Jeroen Ooms
Jeroen Ooms

Reputation: 32978

I think the underlying problem is that the libjson option JSON_UNICODE is not enabled in RJSONIO. However it seems like the problem does not manifest itself when the input is UTF-8 encoded:

library(RJSONIO)
x = "北京填鴨们"
identical(x, fromJSON(toJSON(x)))
# [1] TRUE

The problem only appears when the input uses JSON escaped characters. In these cases, RJSONIO seems to generate latin1 output, but doesn't mark set the encoding correctly:

x <- fromJSON('["Z\\u00FCrich"]')
print(x)
# [1] "Z\xfcrich"

nchar(x)
#Error in nchar(x) : invalid multibyte string 1

For this simple example we can fix it by manually setting the encoding to latin1:

#Set the correct encoding
Encoding(x) <- "latin1"
print(x)
#[1] "Zürich" 

However, this of course won't work for characters outside the latin1 set:

#This should be: "填"
fromJSON('["\\u586B"]')

Upvotes: 1

Duncan Temple Lang
Duncan Temple Lang

Reputation: 141

This is not in fact a bug in RJSONIO. It is designed to expect a string that has been read by R and which has the non-ASCII characters already processed. When one passes it a string with \u, that has not been processed but escaped. On my machine with a locale set to en_US.UTF-8, the command

fromJSON('{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}')

produces

$query
$query$categorymembers
$query$categorymembers[[1]]
$query$categorymembers[[1]]$ns
[1] 0

$query$categorymembers[[1]]$title
[1] "Banach–Tarski paradox"

Note that the character is prefixed by \u not \\u. See how it appears in R when you simply enter that string.

So the issue is upstream of fromJSON() as to why the string contains \u.
I may add support in RJSONIO for handling such unprocessed strings.

Upvotes: 9

Simon Urbanek
Simon Urbanek

Reputation: 13932

It is a bug in RJSONIO as you can clearly see:

> RJSONIO::fromJSON('{"x":"foo\\u2013bar"}')
           x 
"foo\023bar" 

It works just fine in rjson:

> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo–bar"

and to prove it is the correct value:

 > Sys.setlocale("LC_ALL", "C")
[1] "C/C/C/C/C/en_US.UTF-8"
> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo<U+2013>bar"

In your analysis you got confused by printed string vs actual strings. print quotes its content for printing - if you want to see the actual string, you can use cat or charToRaw. Also scan doesn't interpret any escapes, so you get what you give it.

Upvotes: 5

Related Questions