Reputation: 9740
I am using Crawler4j to simply get the HTML from the crawled pages. It successfully stores the retrieved HTML for my test site of about 50 pages. It uses the shoudVisit
method I implemented, and it uses the visit
method I implemented. These both run without any problems. The files are also written with no problems. But after all the pages have been visited and stored, it doesn't stop blocking:
System.out.println("Starting Crawl");
controller.start(ExperimentCrawler.class, numberOfCrawlers);
System.out.println("finished crawl");
The second println
statement never executes. In my storage destination, the crawler has created a folder called 'frontier' that it holds a lock on (I can't delete it since the crawler is still using it).
Here are the config settings I've given it (though it doesn't seem to matter what settings I set):
config.setCrawlStorageFolder("/data/crawl/root");
config.setMaxDepthOfCrawling(1);
config.setPolitenessDelay(1000);
config.setMaxPagesToFetch(50);
config.setConnectionTimeout(500);
There is an error that appears about one minute after the crawl finishes:
java.lang.NullPointerException
at com.sleepycat.je.Database.trace(Database.java:1816)
at com.sleepycat.je.Database.sync(Database.java:489)
at edu.uci.ics.crawler4j.frontier.WorkQueues.sync(WorkQueues.java:187)
at edu.uci.ics.crawler4j.frontier.Frontier.sync(Frontier.java:182)
at edu.uci.ics.crawler4j.frontier.Frontier.close(Frontier.java:192)
at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:232)
at java.lang.Thread.run(Unknown Source)
What could be keeping the crawler from exiting? What is it writing to the 'frontier' folder?
Upvotes: 1
Views: 1018
Reputation: 789
You are using an old version of crawler4j.
The bug you are mentioning is very irritating, and is actually a bug in the internalDB crawler4j is using: BerklyDB.
Crawler4j, uses internally the frontier directory and you shouldn't worry or touch it, as it is only for internal use.
All of the above being said - I have fixed that bug, and you should download the latest version of crawler4j which contains my bugfixes (lots of bugfixes including your mentioned one).
So please go to our new site: https://github.com/yasserg/crawler4j
Follow the instructions about installing it (maven?) And enjoy the new and very improved version.
The external API almost didn't change (only really slightly).
Enjoy the new (currently v4.1) version.
Upvotes: 1