Martin Schmelzer
Martin Schmelzer

Reputation: 23909

Writing a data.frame into a spreadsheet: Encoding problems

I have extraced some data from Instagram. All of the API request are done by

fromJSON(rawToChar(GET(url)$content))

When I try to display that dataframe within a Shiny application using renderTable() I get the error message

Error in sort.list(y) : 
  invalid input 'Ivy Manhattan í ¼í½‰' in 'utf8towcs'

Ok, it looks like an encoding problem. Ivy Manhattan í ¼í½‰ is the full name of a Instagram user. Intersting is, that the output is being rendered properly when I use rederDataTable(). Sure, why dont I just use it then?

The actual problem is, that I want to be able to write this dataframe to a spreadsheet using the package XLConnect. And when the program reaches the command writeWorksheet(wb, data) I get the same error.

Is it because those characters are unknown to R? How can I "filter" the dataframe for problematic strings?

First Edit as reaction to MrFlick's comment:

Well what I get when I fetch a request to the API is

GET(paste("https://api.instagram.com/v1/users/143/?client_id=f714dd*********58b18f"))$content

[1] 7b 22 6d 65 74 61 22 3a 7b 22 63 6f 64 65 22 3a 32 30 30 7d 2c 22 64 61 74 61 22  3a 7b 22 75 73 65 72 6e 61 6d 65 22 ... AND SO ON

rawToChar() converts that to the JSON string which is readable/accepted by fromJSON()

@MrFlick Second edit:

Here is the output of Sys.getlocale() from my machine:

> Sys.getlocale()
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252
LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"

GET declares the fetched string as encoded with utf-8. When I try to convert those with

iconv(x, from="utf-8", to="ASCII",sub="")

Nothing changes.

Upvotes: 0

Views: 1975

Answers (2)

MrFlick
MrFlick

Reputation: 206401

As discussed above, it would probably be better to use

content(GET(url), as="parsed", encoding="UTF-8")

this takes advantage of the httr package's ability to decode the content for you.

Note that when you see <U+2800> in output, that does not mean that those exact characters appear in the string. That's R's way of escaping unicode characters just like it adds extra slashed to escape other special characters like \r. You are seeing those characters because if your locale settings. You didn't mention what OS you are on. The Mac will use UTF-8 by default and should try to display those characters. I don't have access to a windows machine to test what the default is there. They seem to appear as "" when the locale "LC_ALL" is set to "C". This returns

Sys.getlocale()
# [1] "C/C/C/C/C/en_US.UTF-8"

x <- "\u2800\u2800\u2800Jenny";
print(x)
# [1] "<U+2800><U+2800><U+2800>Jenny"

so there aren't actually less-than/greater-than symbols or capital U's in the string. That's just how the C encoding will display them. If you want to remove non-ascii characters, you could do

iconv(x, from="UTF-8", to="ASCII", sub="")
# [1] "Jenny"

Excel may very well be able to handle other types of encoding but I personally don't know how that's managed with XLConnect

Upvotes: 0

Martin Schmelzer
Martin Schmelzer

Reputation: 23909

I tried to just encode the JSON part data$fullname since that seems to be the problem. I tried Encoding(data$fullname) = "UTF-8" at first which didnt resolve the situation. But then I switched to latin1and the spreadsheet happend to be written. Thanks for your pushy ideas! :)

Upvotes: 1

Related Questions