Mark
Mark

Reputation: 873

HtmlUnit gets page error

I am trying to parse this page.

http://www.reuters.com/article/2015/07/08/us-china-cybersecurity-idUSKCN0PI09020150708

My code looks like this

  WebClient webClient = new WebClient(BrowserVersion.CHROME);
  final HtmlPage page = webClient.getPage("http://www.reuters.com/article/2015/07/08/us-alibaba-singapore-post-idUSKCN0PI03J20150708");
  System.out.println(page.asXml());

It gives me a lot of warnings and a huge call stack. Mostly related to javascript engine. I have used these options

webClient.waitForBackgroundJavaScript(1000000);
webClient.setJavaScriptTimeout(1000000);

But nothing seems to work. This page executes javascript to load the content of the page. I need to wait for the page to load to get the content. Any ideas how I can resolve this issue?

Upvotes: 0

Views: 592

Answers (1)

Ahmed Ashour
Ahmed Ashour

Reputation: 5559

You need to wait just after getting the page, also there is an error of "addImpression" is not defined, I don't know in which JavaScript it is defined.

I feel like you are not using recent version, since there are not lot of warnings.

With latest snapshot I get the content by using:

try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    final HtmlPage page = webClient.getPage("http://www.reuters.com/article/2015/07/08/us-alibaba-singapore-post-idUSKCN0PI03J20150708");
    webClient.waitForBackgroundJavaScript(10000);
    System.out.println(page.asText());
}

Upvotes: 3

Related Questions