Reputation: 11
I'm using RCrawler to crawl ~300 websites. The size of websites is quite diverse: some are small (dozen or so pages) and others are large (1000s pages per domain). To crawl the latter is very time-consuming, and - for my research purpose - the added value of more pages when I already have a few hundred, decreases.
So: is there a way to stop the crawl if an x number of pages is collected?
I know I can limit the crawl with MaxDepth, but even at MaxDepth=2, this is still an issue. MaxDepth=1 is not desirable for my research. Also, I'd prefer to keep MaxDepth high, so the smaller websites do get crawled completely.
Thanks a lot!
Upvotes: 1
Views: 169
Reputation: 11
How about implementing a custom function for the FUNPageFilter
parameter of the Rcrawler
function? The custom function checks the number of files in DIR
and returns FALSE if there are too many files.
Upvotes: 0