Reputation: 1
I am using Solr 5.0, Nutch 1.10 with cygwin on windows server 2008 R2. I am issuing the command as:
bin/crawl -D urls/ bin/urls crawl/ 2
As of my knowledge 2 is the number of rounds for crawling. When I execute this command and read the crawldb I receive only 127 url's which is very less as compared to what is expected. Also it does not crawl at deeper depth. When I issue this command for passing data to Solr:
bin/nutch solrindex http://127.0.0.1:8983/solr/thetest crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
and then perform search then I get only 20 url's in all. Can anyone help. I need to do a deeper depth crawling.
Upvotes: 0
Views: 428
Reputation: 192
You may increase number of round incremently which will fetch you more urls. You may see number of urls fetch in each round in hadoop.log file present in ./logs folder.
You may refer this link
Usage: crawl [-i|--index] [-D "key=value"] -i|--index Indexes crawl results into a configured indexer -D A Java property to pass to Nutch calls Seed Dir Directory in which to look for a seeds file Crawl Dir Directory where the crawl/link/segments dirs are saved Num Rounds The number of rounds to run this crawl for Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2
bin/crawl -i -D solr.server.url=$solrUrl cores/$coreName/urls cores/$coreName/crawl 2
Upvotes: 0