Ciaran
Ciaran

Reputation: 521

Ensure that Nutch has crawled all pages of a particular domain

I am using Nutch to collect all the data from a single domain. How can I ensure that Nutch has crawled every page under a given domain?

Upvotes: 0

Views: 81

Answers (1)

Jorge Luis
Jorge Luis

Reputation: 3253

This is not technically possible. Since there is no limit on the number of different pages that you can have under the same domain. This is especially true for dynamic generated websites. What you could do is look for a sitemap.xml and ensure that all of those URLs are crawled/indexed by Nutch. Since the sitemap is the one indicating that those are the URLs you could use them as guide for what needs to be crawled.

Nutch has a sitemap processor that will inject all the URLs from the sitemap to the current crawldb (i.e it will "schedule" those URLs to be crawled).

As a hint, even Google enforces a maximum number of URLs to be indexed from the same domain when doing a deep crawl. This is usually referred to as a crawl budget.

Upvotes: 2

Related Questions