Reputation: 739
I am using nutch 2.3 and I am trying to get the html content of some urls present on seed.txt file which I pass to nutch into HBase.
So the problem is as below---
First crawl: Everything runs fine and I get the data into HBase with url as the row key.
Second Run: when i run the crawl for second time with different urls I see there are so many urls for the fetching job is running while I have only one url in my seed file.
So my question is how can make sure that nutch only crawls and get the html contents of the urls present in seed.txt and not the out links present in urls html content of seed.txt
Upvotes: 0
Views: 680
Reputation: 8678
I think you want to fetch only domains that are given in seed file. For that update nutch-site.xml as following
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
Upvotes: 1
Reputation: 192
You might keep the iteration of the crawl command as "1" and then nutch will crawl only the urls present in seed.txt file.
e.g.
bin/crawl -i -D solr.server.url=<solrUrl> <seed-dir> <crawl-dir> 1
Also, you can restrict the outer links by configuring your regex-urlfilter.txt present in conf directory.
#accept anything else
+http://doamin.com
Upvotes: 0