Reputation: 555
I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch automatically fetches not-on-list URLs. I do set the crawl parameters as -depth 1 -topN 1000000. But it does not work. Does anyone know how to do this?
Upvotes: 1
Views: 3818
Reputation: 6169
Set this property in nutch-site.xml
. (by default its true so it adds outlinks to the crawldb)
<property>
<name>db.update.additions.allowed</name>
<value>false</value>
<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.
</description>
</property>
Upvotes: 4
Reputation: 2497
Command
nutch crawl urllist -dir crawl -depth 3 -topN 1000000
Even if the problem persists, try to delete your nutch folder and restart the whole process.
Upvotes: 2