tibo
tibo

Reputation: 5494

"Invalid byte 1 of 1-byte UTF-8 sequence" When reading a RSS feed

My code is supra simple :

DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = db.parse("http://blog.rogermontgomery.com/feed/?cat=skaffold");

The problem is that I end with en exception:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:554)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1742)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1619)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1657)
    at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:193)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:772)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:232)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
    at com.skaffold.service.RogerBlogReader.read(RogerBlogReader.java:33)
[...]

I don't get it, the xml header declare the document as UTF-8, the http response is encoded in UTF-8... Any explanations?

Upvotes: 1

Views: 1441

Answers (1)

Ned Batchelder
Ned Batchelder

Reputation: 375744

Not all sequences of bytes are valid UTF-8. A UTF-8 decoder can read a single byte and know based on the byte value that it is illegal in UTF-8. It sounds like you have a bad RSS feed, perhaps one that claims to be UTF-8, but is actually encoded differently, like iso8859-1.

Update: the feed URL is gzip-compressed. Have you tried decompressing it?

Upvotes: 2

Related Questions