user616146
user616146

Reputation: 11

Apache Nutch does not index the entire website, only subfolders

Apache Nutch 1.2 does not index the entire website, only subfolders. My index-page provides links in most areas/subfolders of my website. For example stuff, students, research... But nutch only crawl in one specific folder - "students" in this case. Seems as if links in other directories are not followed.

crawl-urlfilter.txt: +^http://www5.my-domain.de/

seed.txt in the URLs-folder: http://www5.my-domain.de/

Starting nutch with(windows/linux both used): nutch crawl "D:\Programme\nutch-1.2\URLs" -dir "D:\Programme\nutch-1.2\crawl" -depth 10 -topN 1000000

Different variants for depth(5-23) and topN(100-1000000) are tested. Providing more links in seed.txt doesnt help at all, still not following links found in injected pages.

Interestingly, crawling gnu.org works perfect. No robots.txt or preventing meta-tags used in my site.

Any ideas?

Upvotes: 1

Views: 2088

Answers (2)

user1357196
user1357196

Reputation: 101

While attempting to crawl all links from an index page, I discovered that nutch was limited to exactly 100 links of around 1000. The setting that was holding me back was:

db.max.outlinks.per.page

Setting this to 2000 allowed nutch to index all of them in one shot.

Upvotes: 2

Luiscappa
Luiscappa

Reputation: 11

Check out if you´ve got intra domain links limitation (property as false in nutch-site.xml). Also check out other properties as maximun intra-extra links per page and http size. Sometimes they produce wrong results during crawling.

Ciao!

Upvotes: 1

Related Questions