user1773304
user1773304

Reputation: 31

Nutch didn't crawl all URLs from the seed.txt

I am new to Nutch and Solr. Currently I would like to crawl a website and its content is

generated by ASP. Since the content is not static, I created a seed.txt which

contained all the URLs I would like to crawl. For example:

http://us.abc.com/product/10001
http://us.abc.com/product/10002
http://jp.abc.com/product/10001
http://jp.abc.com/product/10002
...

The regex-urlfilter.txt has this filter:

# accept anything else
#+.
+^http://([a-z0-9]*\.)*abc.com/

I used this command to start the crawling:

/bin/nutch crawl urls -solr http://abc.com:8983/solr/ -dir crawl -depth 10 -topN 10

The seed.txt content 40,000+ URLs. However, I found that many of the URLs content are not

able to be found by Solr.

Question:

  1. Is this approach for a large seed.txt workable ?

  2. How can I check a URL was being crawlered ?

  3. Is seed.txt has a size limitation ?

Thank you !

Upvotes: 2

Views: 2905

Answers (2)

DavSeq
DavSeq

Reputation: 3

topN indicates how many of the generated links should be fetched. You could have 100 links which have been generated , but if you set topN as 12, then only 12 of those links will get fetched, parsed and indexed.

Upvotes: 0

Jayendra
Jayendra

Reputation: 52809

Check out the property db.max.outlinks.per.page in the nutch configuration files.
The default value for this property is 100 and hence only 100 urls will be picked up from the seeds.txt and rest would be skipped.
Change this value to a higher number to have all the urls scanned and indexed.

Upvotes: 4

Related Questions