Reputation: 370
Here is my code, I have a query that I transform in UTF8 but finally I get an error that the query is not in UTF8 I don't manage to fix it:
library("XML")
library("methods")
library("httr")
query = http://export.arxiv.org/api/query?search_query=(au:( \"Benoit Bertrand\"))&start=0&max_results=2000
xml_data = xmlToList(iconv(URLencode(query),to="UTF-8"))
Error: 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xC9 0x70 0x69 0x6A
I find that's space character that made the code crash but that is all I got
Upvotes: 0
Views: 1056
Reputation: 131364
The question's code won't compile due to misspellings. Even if those errors were fixed, the code doesn't do something useful - xmlToList
is applied on the URL, not the results of a GET request. That's enough to generate the error :
query<-"http://export.arxiv.org/api/query?search_query=(au:( \"Benoit Bertrand\"))&start=0&max_results=2000"
xmlToList(query)
No amount of URL encoding and conversions will fix that. No conversion is needed either, since the URL falls in the US-ASCII range. In that range a UTF8 string is indistinguishable from an ASCII string.
The correct code to get and parse this Arxiv page is :
//Just a URL
query<-"http://export.arxiv.org/api/query?search_query=(au:( \"Benoit Bertrand\"))&start=0&max_results=2000"
//Get the contents
r <- GET(query)
//Extract the text from the response
xml<-content(r, "text")
//Read as lists
l<-xmlToList(xml)
The response r
isn't just a string, it's an object that contains headers (including the encoding), the response status and the response content. One of the headers is the Content-Type :
> r
Response [http://export.arxiv.org/api/query?search_query=(au:( "Benoit Bertrand"))&start=0&max_results=2000]
Date: 2019-09-30 12:54
Status: 200
Content-Type: application/atom+xml; charset=UTF-8
Size: 786 B
content(r, "text")
converts the content to text using the encoding stored in that header.
After that, xmlToList
can parse the XML string
Upvotes: 3