Reputation: 11
Apache Nutch 1.2 does not index the entire website, only subfolders. My index-page provides links in most areas/subfolders of my website. For example stuff, students, research... But nutch only crawl in one specific folder - "students" in this case. Seems as if links in other directories are not followed.
crawl-urlfilter.txt: +^http://www5.my-domain.de/
seed.txt in the URLs-folder: http://www5.my-domain.de/
Starting nutch with(windows/linux both used): nutch crawl "D:\Programme\nutch-1.2\URLs" -dir "D:\Programme\nutch-1.2\crawl" -depth 10 -topN 1000000
Different variants for depth(5-23) and topN(100-1000000) are tested. Providing more links in seed.txt doesnt help at all, still not following links found in injected pages.
Interestingly, crawling gnu.org works perfect. No robots.txt or preventing meta-tags used in my site.
Any ideas?
Upvotes: 1
Views: 2088
Reputation: 101
While attempting to crawl all links from an index page, I discovered that nutch was limited to exactly 100 links of around 1000. The setting that was holding me back was:
db.max.outlinks.per.page
Setting this to 2000 allowed nutch to index all of them in one shot.
Upvotes: 2
Reputation: 11
Check out if you´ve got intra domain links limitation (property as false in nutch-site.xml). Also check out other properties as maximun intra-extra links per page and http size. Sometimes they produce wrong results during crawling.
Ciao!
Upvotes: 1