Jérémy
Jérémy

Reputation: 370

R Error: 1: Input is not proper UTF-8, indicate encoding ! XMLtoList

Here is my code, I have a query that I transform in UTF8 but finally I get an error that the query is not in UTF8 I don't manage to fix it:

library("XML")
library("methods")
library("httr")

query = http://export.arxiv.org/api/query?search_query=(au:( \"Benoit Bertrand\"))&start=0&max_results=2000
xml_data = xmlToList(iconv(URLencode(query),to="UTF-8"))

Error: 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xC9 0x70 0x69 0x6A

I find that's space character that made the code crash but that is all I got

Upvotes: 0

Views: 1056

Answers (1)

Panagiotis Kanavos
Panagiotis Kanavos

Reputation: 131364

The question's code won't compile due to misspellings. Even if those errors were fixed, the code doesn't do something useful - xmlToList is applied on the URL, not the results of a GET request. That's enough to generate the error :

query<-"http://export.arxiv.org/api/query?search_query=(au:( \"Benoit Bertrand\"))&start=0&max_results=2000"
xmlToList(query)

No amount of URL encoding and conversions will fix that. No conversion is needed either, since the URL falls in the US-ASCII range. In that range a UTF8 string is indistinguishable from an ASCII string.

The correct code to get and parse this Arxiv page is :

//Just a URL
query<-"http://export.arxiv.org/api/query?search_query=(au:( \"Benoit Bertrand\"))&start=0&max_results=2000"
//Get the contents
r <- GET(query)
//Extract the text from the response
xml<-content(r, "text")
//Read as lists
l<-xmlToList(xml)

The response r isn't just a string, it's an object that contains headers (including the encoding), the response status and the response content. One of the headers is the Content-Type :

> r
Response [http://export.arxiv.org/api/query?search_query=(au:( "Benoit Bertrand"))&start=0&max_results=2000]
  Date: 2019-09-30 12:54
  Status: 200
  Content-Type: application/atom+xml; charset=UTF-8
  Size: 786 B

content(r, "text") converts the content to text using the encoding stored in that header.

After that, xmlToList can parse the XML string

Upvotes: 3

Related Questions