Reputation: 47
I am trying to crawl some URLs with Apache Nutch 1.11.
There are 7 URLs in my seeds.txt
file, and I run the command :
bin crawl -i urls crawl 22
My problem is that with depth 22, I expect it to fetch quite a few number of pages. But today, all it does is fetching the exact same URLs that exist in my seeds.txt
file and nothing more. And as weird as it sounds, yesterday the exact same files and properties ended up fetching 313 URLs. I didn't change anything since yesterday. Anyone knows what's going on?
The only thing that has changes is that yesterday I was using another computer. But since I am running the crawl command on a remote computer, I don't think it has anything to do with it. Does it?
Upvotes: 0
Views: 558
Reputation: 4854
Generate a crawl dump with the readdb command and check the nextFetchDate for the seeds or try a fresh crawl with a new crawldb and segments dir to see what happens.
Do the logs reveal anything interesting? Are the seed URLs actually fetched and if so how do you know they are?
Is the content of the seeds likely to have yielded different URLs from the previous day?
fetcher.max.crawl.delay is not related to the scheduling but is about how to behave when robots.txt files set a value so large that it is impractical.
The config you are after is
<property>
<name>db.fetch.interval.default</name>
<value>2592000</value>
<description>The default number of seconds between re-fetches of a page (30 days).
</description>
</property>
i.e refetch a month later. Again, a crawldb dump will give you all the details about what happened to your URLs
Upvotes: 1