Farid Abbasov
Farid Abbasov

Reputation: 87

java.io.IOException: Mark has been invalidated when parsing website with jsoup

When trying parse html page of website it crashes with the error:

java.io.IOException:Mark has been invalidated.

Part of my code:

String xml = xxxxxx;
try {
    Document document = Jsoup.connect(xml).maxBodySize(1024*1024*10)
            .timeout(0).ignoreContentType(true)
            .parser(Parser.xmlParser()).get();

    Elements elements = document.body().select("td.hotv_text:eq(0)");

    for (Element element : elements) {
        Element element1 = element.select("a[href].hotv_text").first();
        hashMap.put(element.text(), element1.attr("abs:href"));
    }
} catch (HttpStatusException ex) {
    Log.i("GyWueInetSvc", "Exception while JSoup connect:" + xml +" cause:"+ ex.getMessage());
} catch (IOException e) {
    e.printStackTrace();
    throw new RuntimeException("Socket timeout: " + e.getMessage(), e);
}

The size of website which I want parse is about 2MB. And when I debug code I see that when in java package ConstrainableInputStream.java method:

public void reset() throws IOException {
    super.reset();remaining = maxSize - markpos;
} 

and returns markpos= -1 then goes to the exception.

How can I solve that problem?

Upvotes: 1

Views: 3602

Answers (5)

Angel Koh
Angel Koh

Reputation: 13545

to add on to @ulong's answer, reguarding the use of bufferUp()

this is recommended in the documentation within the jsoup codes itself if you need to parse the document several times. BufferUp is called before parse, so that the InputStream will not be drained, resulting in an invalid mark error (IOException)

    /**
     * Read and parse the body of the response as a Document. If you intend to parse the same response multiple
     * times, you should {@link #bufferUp()} first.
     * @return a parsed Document
     * @throws IOException on error
     */
    Document parse() throws IOException;

and reguarding bufferUp()

    /**
     * Read the body of the response into a local buffer, so that {@link #parse()} may be called repeatedly on the
     * same connection response (otherwise, once the response is read, its InputStream will have been drained and
     * may not be re-read). Calling {@link #body() } or {@link #bodyAsBytes()} has the same effect.
     * @return this response, for chaining
     * @throws UncheckedIOException if an IO exception occurs during buffering.
     */
    Response bufferUp();

Upvotes: 0

nDijax
nDijax

Reputation: 521

I've got the same exception when upgrading to 1.12.2 from 1.11.3 Try downgrade your dependecies

Upvotes: 1

ulong Mask
ulong Mask

Reputation: 51

This is helped me:

GET: .execute().bufferUp().parse();
POST: .method(Connection.Method.POST).execute().bufferUp().parse();

Upvotes: 5

Ovokerie Ogbeta
Ovokerie Ogbeta

Reputation: 513

Use ~.execute().parse(); instead of ~.get(); to get the document and remove the parser thus your code becomes;

Document document = Jsoup.connect(xml).maxBodySize(1024*1024*10)
            .timeout(0).ignoreContentType(true)
            .execute().parse();  

this is a temporary fix as we await the new version which will fix the bug

Upvotes: -1

Farid Abbasov
Farid Abbasov

Reputation: 87

I found solution of the problem. Problem was in buffer overloading. Solved using below code:

BufferedReader br = null;


try{
       connection =  new URL(xml).openConnection();


       Scanner scanner = new Scanner(connection.getInputStream());


       while (scanner.hasNextLine()) {


             String line = scanner.nextLine();


             content = content +line;
       }

} catch (MalformedURLException e) {


       e.printStackTrace();


} catch (IOException e) {


       e.printStackTrace();



} 
Document document = Jsoup.parse(content);

Upvotes: 2

Related Questions