Java: HtmlUnit problem retrieving page title

Question

This is my first StackOverflow post so I'll try to describe my problem as good as I can.

I want to create a program to retrieve the reviews from TripAdvisor pages, I tried to do it via API but they didnt respond when I requested the API key, so my alternative is to do it with a WebCrawler.

To do so I have a Spring project and using HtmlUnit,a tool I never used, so in order to test it my first exercise is to retrieve the title of a webpage so I have the following code implemented:

@PostConstruct
public void init() throws Exception {
    TimeZone.setDefault(TimeZone.getTimeZone("Europe/Madrid"));

    getRequest.getPageName();

}

That calls the following method:

@Test
public void getPageName() throws Exception {
    try (final WebClient webClient = new WebClient()) {
        final HtmlPage page = webClient.getPage("https://www.tripadvisor.com");
        
        System.out.println(page.getTitleText());

    }
    catch (Exception e){
        System.out.println("ERROR " + e);
    }
}

When I run the code with https://www.google.com I get the response "Google" as excpected, but when I try it with https://www.tripadvisor.com or https://www.youtube.com I get an error that I can't understand:

Caused by: net.sourceforge.htmlunit.corejs.javascript.EvaluatorException: syntax error (https://static.tacdn.com/assets/DDGchX.17d9b05f.js#1)

I did a quick research to see what does the problem mean, I found a couple of posts regarding a similar case, but I can't understand what is the cause. Is it related to a Javascript problem? Or a permissions problem?

If more information is required to analyze the problem do not hesitate on asking for it, thanks in advance for the spent time of any reader and sorry if i disrespected any of the StackOverflow rules/protocols.

RBRi · Accepted Answer

    try (final WebClient webClient = new WebClient()) {
        webClient.getOptions().setThrowExceptionOnScriptError(false);

        final HtmlPage page = webClient.getPage("https://www.tripadvisor.com");
        // final HtmlPage page = webClient.getPage("https://www.youtube.com");

        System.out.println("****************");
        System.out.println(page.getTitleText());
        System.out.println("****************");
    }
    catch (Exception e){
        System.out.println("ERROR " + e);
    }

At least with the recent version oh HtmlUnit this produces

****************
Tripadvisor: Read Reviews, Compare Prices & Book
****************

What does setThrowExceptionOnScriptError do?

/**
 * Changes the behavior of this webclient when a script error occurs.
 * @param enabled indicates if exception should be thrown or not
 */

HtmlUnit uses Rhino (https://github.com/mozilla/rhino) as base for the JavaScript support. And Rhino does not support all the language features available in JavaScript today (getting better with every version https://htmlunit.sourceforge.io/changes-report.html). But at least some of the pages around using this features (e.g. to track you) and because of this you see the error. HtmlUnit was originally designed as testing framework, because of this it stops at every error.

If you change that (see the option setting above) you still get the log output for every error but the javascript processing continues (same in real browsers). An you can also change the logging - see https://htmlunit.sourceforge.io/logging.html.

Java: HtmlUnit problem retrieving page title

Answers (1)

Related Questions