Web crawling/scraping GWT based web pages

Question

I am trying to crawl a web page that is built using GWT and uses the GWT RPC mechanism for AJAX calls. The page I am trying to crawl is not mine - so I can't edit the server side. I am very new to GWT and from my initial couple of days with it - I think that you can't de-serialize the data unless you've the case interface with you.

Am I right or Is there a way to crawl the data intelligently?

milan · Accepted Answer

You could do it using htmlunit and WebClient:

//real code mixed with pseudo-code:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3);
Map urls = new HashMap<->();
LinkedList urlsToVisit = new LinkedList<->();
urlsToVisit.put("http://some_gwt_app.com/#!home");
while (!urlsToVisit.isEmpty()) {
    String page = urlsToVisit.remove();
    if (urls.containsKey(page)) { 
        continue;
    }
    String rendered = webClient.getPage(page);
    urls.put(page, rendered);
    urlsToVisit.addAll(extractLinks(page));
}

You might have to experiment with the WebClient options a bit. In my case these seem to do a good job:

webClient.setThrowExceptionOnScriptError(false);
webClient.setRedirectEnabled(true);
webClient.setJavaScriptEnabled(true);
// important! Give the headless browser enough time to execute
// JavaScript. The exact time to wait may depend on your application.
webClient.waitForBackgroundJavaScript(20000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());

Web crawling/scraping GWT based web pages

Answers (2)

Related Questions