Tony Stark
Tony Stark

Reputation: 265

Scrape HTML from websites which reloads page after few seconds

I want to scrape HTML from websites like http://www3.mangafreak.net/Manga/One_Piece using Jsoup and HtmlUnit. Problem with websites like this is first it give

Status Code:503 Service Temporarily Unavailable

and then after few seconds it reloads the page with

Status Code:200 OK

Upvotes: 0

Views: 132

Answers (1)

RBRi
RBRi

Reputation: 2889

Try this (HtmlUnit only)

    WebClient webClient = new WebClient();
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

    HtmlPage page = (HtmlPage) webClient.getPage("http://www3.mangafreak.net/Manga/One_Piece");
    System.out.println(page.asXml());

    WebWindow window = page.getEnclosingWindow();
    window.getJobManager().waitForJobsStartingBefore(5000);

    page = (HtmlPage) window.getEnclosedPage();
    System.out.println(page.asXml());

No you have the page and you can use the HtmlUnit API for having fun with the DOM tree or to click on something....

Upvotes: 1

Related Questions