Crawler4j keeps blocking after crawl

Question

I am using Crawler4j to simply get the HTML from the crawled pages. It successfully stores the retrieved HTML for my test site of about 50 pages. It uses the shoudVisit method I implemented, and it uses the visit method I implemented. These both run without any problems. The files are also written with no problems. But after all the pages have been visited and stored, it doesn't stop blocking:

System.out.println("Starting Crawl");
controller.start(ExperimentCrawler.class, numberOfCrawlers);
System.out.println("finished crawl");

The second println statement never executes. In my storage destination, the crawler has created a folder called 'frontier' that it holds a lock on (I can't delete it since the crawler is still using it).

Here are the config settings I've given it (though it doesn't seem to matter what settings I set):

config.setCrawlStorageFolder("/data/crawl/root");
config.setMaxDepthOfCrawling(1);
config.setPolitenessDelay(1000);
config.setMaxPagesToFetch(50);
config.setConnectionTimeout(500);

There is an error that appears about one minute after the crawl finishes:

java.lang.NullPointerException at com.sleepycat.je.Database.trace(Database.java:1816) at com.sleepycat.je.Database.sync(Database.java:489) at edu.uci.ics.crawler4j.frontier.WorkQueues.sync(WorkQueues.java:187) at edu.uci.ics.crawler4j.frontier.Frontier.sync(Frontier.java:182) at edu.uci.ics.crawler4j.frontier.Frontier.close(Frontier.java:192) at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:232) at java.lang.Thread.run(Unknown Source)

What could be keeping the crawler from exiting? What is it writing to the 'frontier' folder?

Crawler4j keeps blocking after crawl

Answers (1)

Related Questions