Reputation: 139
I am trying to crawl a web page that is built using GWT and uses the GWT RPC mechanism for AJAX calls. The page I am trying to crawl is not mine - so I can't edit the server side. I am very new to GWT and from my initial couple of days with it - I think that you can't de-serialize the data unless you've the case interface with you.
Am I right or Is there a way to crawl the data intelligently?
Upvotes: 1
Views: 1382
Reputation: 12412
You could do it using htmlunit and WebClient:
//real code mixed with pseudo-code:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3);
Map<String, String> urls = new HashMap<->();
LinkedList<String> urlsToVisit = new LinkedList<->();
urlsToVisit.put("http://some_gwt_app.com/#!home");
while (!urlsToVisit.isEmpty()) {
String page = urlsToVisit.remove();
if (urls.containsKey(page)) {
continue;
}
String rendered = webClient.getPage(page);
urls.put(page, rendered);
urlsToVisit.addAll(extractLinks(page));
}
You might have to experiment with the WebClient options a bit. In my case these seem to do a good job:
webClient.setThrowExceptionOnScriptError(false);
webClient.setRedirectEnabled(true);
webClient.setJavaScriptEnabled(true);
// important! Give the headless browser enough time to execute
// JavaScript. The exact time to wait may depend on your application.
webClient.waitForBackgroundJavaScript(20000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
Upvotes: 1
Reputation: 329
I scrape for a living, and GWT is the one framework that almost always flummoxes me. The fact that it passes serialized, non-human readable parameters prevents me from interject logic that will access the site.
On some simple GWT I've gotten scrapes to work be parsing the JavaScript and running portions as is, but I can't get all to work.
Upvotes: 1