Bakaburg
Bakaburg

Reputation: 3321

Escaped chars don't get correctly intepreted when reading from remote json and using jsonlite::fromJSON() in R

I'm trying to read some json from a api page of a twitter firehose. In the tweets I download there are many no english character. eg:

"text":"Vaccini: perch\u00e9 fare l\u2019esavalente quando gli obbligatori sono solo quattro? http://t.co/dLdzoXOUUK via @wireditalia"

When I import the tweets data via readLines in R and print it I see:

\\"text\\":\\"Vaccini: perch\\u00e9 fare l\\u2019esavalente quando gli obbligatori sono solo quattro? http://t.co/dLdzoXOUUK via @wireditalia\\"

So, both backslashes and quotes are escaped. If I print only using cat() the escaping is not there anymore. So I thought a problem with print(). But when I parsed it with fromJSON I see that strings like \u00e9 become \xe9. I tried to understand why and by some test I noticed that

fromJSON('["\\u00e9"]') 

prints

"\xe9"

and

fromJSON('["\\u2019"]') 

prints

"\031"

instead of respectively "'" and "é", as it should. So jsonlite::fromJSON misinterpretate those double backslashes.

But the problem is the double backslashes themselves! Why R escapes everything in first place? I cannot even gsub('\u', '\u', text, fixed=T) but it returns:

Error: '\u' used without hex digits in character string starting "'\u"

because it sees \u like a special char and doesn't allow to be used as replacement!

Moreover this default escaping by R also make my script fail when it encounters one user who set this as location:

"location":"V\u03b1l\u03c1\u03bfl\u03b9c\u03b5ll\u03b1-V\u03b5r\u03bfn\u03b1 /=\\","default_profile_image":false

which in her twitter profile is:

Vαlροlιcεllα-Vεrοnα /=\

That \" in the source code is displayed on R as /=\', therefore breaking the json.

So, I need a way to escape this escaping problems!

Upvotes: 1

Views: 691

Answers (1)

MrFlick
MrFlick

Reputation: 206401

The problem is in your input data. The text you read into R should not have \u values as plain text. This is just incorrect. When R displays a value with \u that is an escape sequence for UTF characters, there aren't actually any slashes or "u"s in the text.

But if you have bad data that you need to read into R, you can find all the \u values followed by hexadecimal digits and replace them with proper Unicode characters. For example, say you have the string in T

tt<-"\\u00e9 and \\u2019 and \\u25a0"

If you cat() the value in R to remove escaping, you will see that is contains

cat(tt)
#\u00e9 and \u2019 and \u25a0

So there are "\u" values in the text (they are not true unicode characters). We can find and replace them with

m <- gregexpr("\\\\u[0-9A-Fa-f]{4}", tt)
regmatches(tt,m) <- lapply(
    lapply(regmatches(tt,m), substr,3, 999), function(x)  
    intToUtf8(as.integer(as.hexmode(x)), multiple=TRUE))
tt
# [1] "é and ’ and ■"

This will find all the "\u" values and replace them.

It's just important to note that

fromJSON('["\\u2019"]')

is not a unicode character. By doing the double backslash, you've escaped the escape character so you just literally have slash-u. To get a true unicode character you need

fromJSON('["\u2019"]') 

If your data were properly encoded before being loaded into R, this wouldn't be a problem. I don't understand what you are using to download the tweets, but clearly it is messing things up.

Upvotes: 1

Related Questions