mayayaya
mayayaya

Reputation: 11

RCrawler : way to limit number of pages that RCrawler collects? (not crawl depth)

I'm using RCrawler to crawl ~300 websites. The size of websites is quite diverse: some are small (dozen or so pages) and others are large (1000s pages per domain). To crawl the latter is very time-consuming, and - for my research purpose - the added value of more pages when I already have a few hundred, decreases.

So: is there a way to stop the crawl if an x number of pages is collected?

I know I can limit the crawl with MaxDepth, but even at MaxDepth=2, this is still an issue. MaxDepth=1 is not desirable for my research. Also, I'd prefer to keep MaxDepth high, so the smaller websites do get crawled completely.

Thanks a lot!

Upvotes: 1

Views: 169

Answers (1)

Dan T.
Dan T.

Reputation: 11

How about implementing a custom function for the FUNPageFilter parameter of the Rcrawler function? The custom function checks the number of files in DIR and returns FALSE if there are too many files.

Upvotes: 0

Related Questions