Reputation: 3321
I'm trying to read some json from a api page of a twitter firehose. In the tweets I download there are many no english character. eg:
"text":"Vaccini: perch\u00e9 fare l\u2019esavalente quando gli obbligatori sono solo quattro? http://t.co/dLdzoXOUUK via @wireditalia"
When I import the tweets data via readLines in R and print it I see:
\\"text\\":\\"Vaccini: perch\\u00e9 fare l\\u2019esavalente quando gli obbligatori sono solo quattro? http://t.co/dLdzoXOUUK via @wireditalia\\"
So, both backslashes and quotes are escaped. If I print only using cat() the escaping is not there anymore. So I thought a problem with print(). But when I parsed it with fromJSON I see that strings like \u00e9 become \xe9. I tried to understand why and by some test I noticed that
fromJSON('["\\u00e9"]')
prints
"\xe9"
and
fromJSON('["\\u2019"]')
prints
"\031"
instead of respectively "'" and "é", as it should. So jsonlite::fromJSON misinterpretate those double backslashes.
But the problem is the double backslashes themselves! Why R escapes everything in first place? I cannot even gsub('\u', '\u', text, fixed=T) but it returns:
Error: '\u' used without hex digits in character string starting "'\u"
because it sees \u like a special char and doesn't allow to be used as replacement!
Moreover this default escaping by R also make my script fail when it encounters one user who set this as location:
"location":"V\u03b1l\u03c1\u03bfl\u03b9c\u03b5ll\u03b1-V\u03b5r\u03bfn\u03b1 /=\\","default_profile_image":false
which in her twitter profile is:
Vαlροlιcεllα-Vεrοnα /=\
That \" in the source code is displayed on R as /=\', therefore breaking the json.
So, I need a way to escape this escaping problems!
Upvotes: 1
Views: 691
Reputation: 206401
The problem is in your input data. The text you read into R should not have \u
values as plain text. This is just incorrect. When R displays a value with \u
that is an escape sequence for UTF characters, there aren't actually any slashes or "u"s in the text.
But if you have bad data that you need to read into R, you can find all the \u
values followed by hexadecimal digits and replace them with proper Unicode characters. For example, say you have the string in T
tt<-"\\u00e9 and \\u2019 and \\u25a0"
If you cat()
the value in R to remove escaping, you will see that is contains
cat(tt)
#\u00e9 and \u2019 and \u25a0
So there are "\u" values in the text (they are not true unicode characters). We can find and replace them with
m <- gregexpr("\\\\u[0-9A-Fa-f]{4}", tt)
regmatches(tt,m) <- lapply(
lapply(regmatches(tt,m), substr,3, 999), function(x)
intToUtf8(as.integer(as.hexmode(x)), multiple=TRUE))
tt
# [1] "é and ’ and ■"
This will find all the "\u" values and replace them.
It's just important to note that
fromJSON('["\\u2019"]')
is not a unicode character. By doing the double backslash, you've escaped the escape character so you just literally have slash-u. To get a true unicode character you need
fromJSON('["\u2019"]')
If your data were properly encoded before being loaded into R, this wouldn't be a problem. I don't understand what you are using to download the tweets, but clearly it is messing things up.
Upvotes: 1