Xiao
Xiao

Reputation: 555

Using Nutch to crawl a specified URL list

I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch automatically fetches not-on-list URLs. I do set the crawl parameters as -depth 1 -topN 1000000. But it does not work. Does anyone know how to do this?

Upvotes: 1

Views: 3818

Answers (2)

Tejas Patil
Tejas Patil

Reputation: 6169

Set this property in nutch-site.xml. (by default its true so it adds outlinks to the crawldb)

<property>
  <name>db.update.additions.allowed</name>
  <value>false</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

Upvotes: 4

Debaditya
Debaditya

Reputation: 2497

  • Delete the crawl and urls directory (if created before)
  • Create and Update the seed file ( where URLs are listed 1URL per row)
  • Restart the crawling process

Command

nutch crawl urllist -dir crawl -depth 3 -topN 1000000
  • urllist - Directory where seed file (url list) is present
  • crawl - Directory name

Even if the problem persists, try to delete your nutch folder and restart the whole process.

Upvotes: 2

Related Questions