Reputation: 1782
I am trying to scrap some websites by using htmlunit 2.16. Websites content are bit heavy and having pages around 5000. I am getting Java heap space issue after some page being scrapped. I have allocated -Xms1500m and -Xmx3000m. But after running 30/45 mins it throws java out of memory. Here is my example:
try (WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38)) {
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setPrintContentOnFailingStatusCode(false);
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.getOptions().setAjaxController(new NicelyResynchronizingAjaxController());
// Get 1st page Data
HtmlPage currentPage = webClient.getPage("www.example.com");
for (int i = 0; i < 5000; i++) {
try {
HtmlElement next = (HtmlElement) currentPage
.getByXPath("//span[contains(text(),'Next')]")
.get(0);
currentPage = next.click();
webClient.waitForBackgroundJavascript(10000);
System.out.println("Got data: " + currentPage.asXml());
} catch (Exception e) {
e.printStackTrace(System.err);
}
}
} catch (Exception e) {
e.printStackTrace(System.err);
}
As we can see i click on the next button to get the content. I have webClient.close()
also. Can anyone faced similar kind of issue ? Does htmlunit has some memory leak ?
Upvotes: 5
Views: 1012
Reputation: 2879
Please try the latest version of HtmlUnit. We have fixed many memory issues inbetween. At least 2.23 hast some fixes regarding history. Additionally you can now control the history size.
Upvotes: 1
Reputation: 10151
Maybe the problem is that all the pages are still stored in the history.
I disable the browsing history this way:
try {
final History window = webClient.getWebWindows().get(0).getHistory();
final Field f = window.getClass().getDeclaredField("ignoreNewPages_"); //NoSuchFieldException
f.setAccessible(true);
((ThreadLocal<Boolean>) f.get(window)).set(Boolean.TRUE);
LOGGER.debug("_dbff772d4d_ disabled history of Webclient");
}
catch (final Exception e) {
LOGGER.warn("_66461112f7_ Can't disable history of Webclient");
}
I got the idea from how-to-limit-htmlunits-history-size
These configurations are not related to your problem, but where useful in my projects:
webClient.setJavaScriptTimeout(JAVASCRIPT_TIMOUT);
webClient.getOptions().setTimeout(WEB_TIMEOUT);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setPopupBlockerEnabled(true);
webClient.setRefreshHandler(new WaitingRefreshHandler(REFRESH_HANDLER_WAIT_LIMIT));
Upvotes: 2