Reputation: 739
I am using nutch 2.x. So I am trying to use nutch command with depth option as
$: nutch inject ./urls/seed.txt -depth 5
after executing this command getting message like
Unrecognized arg -depth
so when i got failed in this i tried to use nutch crawl as
$: nutch crawl ./urls/seed.txt -depth 5
getting error like
Command crawl is deprecated, please use bin/crawl instead
So i tried to use crawl command to crawl urls in seed.txt with the depth option in that case it is asking for solr but i am not using solr
so my question is how to crawl a website by specifying depth
Upvotes: 3
Views: 2841
Reputation: 727
My question is what do you want to do by crawling the page and not indexing it in SOLR?
Answer to your question:
If you want to use Nutch Crawler and you don want to index it into SOLR then remove following piece of code from crawl script.
http://technical-fundas.blogspot.com/2014/07/crawl-your-website-using-nutch-crawler.html
Answer to you other question:
How to get the HTML content for all the links that has been crawled by Nutch(check this link):
How to get the html content from nutch
This will definitely resolve your issue.
Upvotes: 1