sachingupta
sachingupta

Reputation: 739

how to set nutch to extract content of only urls present on seed file

I am using nutch 2.3 and I am trying to get the html content of some urls present on seed.txt file which I pass to nutch into HBase.

So the problem is as below---

First crawl: Everything runs fine and I get the data into HBase with url as the row key.

Second Run: when i run the crawl for second time with different urls I see there are so many urls for the fetching job is running while I have only one url in my seed file.

So my question is how can make sure that nutch only crawls and get the html contents of the urls present in seed.txt and not the out links present in urls html content of seed.txt

Upvotes: 0

Views: 680

Answers (2)

Hafiz Muhammad Shafiq
Hafiz Muhammad Shafiq

Reputation: 8678

I think you want to fetch only domains that are given in seed file. For that update nutch-site.xml as following

  <property>
   <name>db.ignore.external.links</name>
   <value>true</value>
  </property>

Upvotes: 1

Rocky Mena
Rocky Mena

Reputation: 192

You might keep the iteration of the crawl command as "1" and then nutch will crawl only the urls present in seed.txt file.

e.g.

bin/crawl -i -D solr.server.url=<solrUrl> <seed-dir> <crawl-dir> 1

Also, you can restrict the outer links by configuring your regex-urlfilter.txt present in conf directory.

#accept anything else
+http://doamin.com

Upvotes: 0

Related Questions