S.Doe
S.Doe

Reputation: 47

Nutch fetches just the URLs that exist in seeds file

I am trying to crawl some URLs with Apache Nutch 1.11. There are 7 URLs in my seeds.txt file, and I run the command :

bin crawl -i urls crawl 22

My problem is that with depth 22, I expect it to fetch quite a few number of pages. But today, all it does is fetching the exact same URLs that exist in my seeds.txt file and nothing more. And as weird as it sounds, yesterday the exact same files and properties ended up fetching 313 URLs. I didn't change anything since yesterday. Anyone knows what's going on?

The only thing that has changes is that yesterday I was using another computer. But since I am running the crawl command on a remote computer, I don't think it has anything to do with it. Does it?

Upvotes: 0

Views: 558

Answers (1)

Julien Nioche
Julien Nioche

Reputation: 4854

Generate a crawl dump with the readdb command and check the nextFetchDate for the seeds or try a fresh crawl with a new crawldb and segments dir to see what happens.

Do the logs reveal anything interesting? Are the seed URLs actually fetched and if so how do you know they are?

Is the content of the seeds likely to have yielded different URLs from the previous day?

fetcher.max.crawl.delay is not related to the scheduling but is about how to behave when robots.txt files set a value so large that it is impractical.

The config you are after is

<property>
      <name>db.fetch.interval.default</name>
      <value>2592000</value>
      <description>The default number of seconds between re-fetches of a page (30 days).
      </description>
    </property>

i.e refetch a month later. Again, a crawldb dump will give you all the details about what happened to your URLs

Upvotes: 1

Related Questions