Improving Crawler4j-Crawler efficiency,scalabitlity

Question

I am Using Crawler4j crawler to crawl some domains.Now I want to Improve the efficiency of the crawler, I want my crawler to use my full bandwidth and crawl as many url's as possible in a given time period.For that I am taking the following settings:-

I have increased the no. of crawler threads to 10 (using this function ContentCrawler('classfilename',10);)
I have reduced the politeness delay to 50 ms (using Crawlconfig.setpolitenessdelay(50);)
I am giving depth of crawling as 2 (using Crawlconfig.setMaxDepthOfCrawling(2))

Now what I want to know is:-

1) Are there any side effects with these kind of settings.

2) Are there any things I have to do apart from this so that I can improve my Crawler speed.

3) Can some one tell me maximum limits of every setting(ex:- Max no. of threads supported by crawler4j at a time etc).Beacuse I have already gone through the code of Crawler4j but I did not find any limits any where.

4)How to crawl a domain without checking it's robots.txt file.Beacause I understood that crawler4j is first checking a Domain's robots.txt file before crawling.I don't want that!!

5)How does page fetcher works(pls explain it briefly)

Any help is appreciated,and pls go easy on me if the question is stupid.

Tobias K. · Accepted Answer

I'll try my best to help you here. I cant garantee for correctness neither completeness.

b) Reducing the politness delay will create more load on the site to crawl and can (on small servers) increase the recieving time in long term. But this is not a common problem nowadays so 50ms should still be fine. Also note that if it takes 250ms to recieve the response from the webserver it will still take 250ms for the next page to be crawled by this thread.

c) I am not quite sure what you want to achieve with setting the crawlDepth to a value of two. E.g. a crawl depth from 1 would mean you crawl the seed than u crawl every site found on the seed and than u stop. (crawlDepth = 2 would just go one step further and so on). This will not influence your crawl speed, just your crawl time and the pages found.
Do not implement time heavy actions within the CrawlerThread and all methods/classes it covers. Do them at the end or in an extra thread.
There are no limits provided by the crawler-configuration itself. Limits will be set by your CPU(not likely) or the structure of the site to crawl (very likely).
Add this line to your CrawlController: robotstxtConfig.setEnabled(false);

It should look like this now:

PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

The page fetcher will set some parameters and then send a HTTPget request to the webservice on the given url with the previous set parameters. The response from the webserver will be evaluated and some information like the response header and the html code in binary form will be saved.

Hope I could help you a bit.

Improving Crawler4j-Crawler efficiency,scalabitlity

Answers (1)

Related Questions