Nutch failed to crawl particular site

Question

I am using nutch 1.4 to crawl websites. For demo purpose, I started crawling with jabong.com but i observed that nutch could not fetch all the links in the site.

After visiting http://www.jabong.com/women/clothing/womens-suits-sets/ It is not fetching links present in this site which are mapped on images.

I have configured nutch as:- conf/nuth-default.xml ---> added the agent name conf/regex-urlfilter.txt ---> Instead of +. , I wrote +^http://([a-z0-9]*.)*jabong.com/ seed.txt contains http://www.jabong.com/

Can someone tell me what could be the problem it is not fetching all the links ?

Lina Clark · Accepted Answer

Finally, able to solve this problem after breaking my head for long. So sharing it here :) You have to adjust the parameters defined in nutch-default.xml in conf directory

So check the max.content.length, value defined for this will be around 60K but actually the page content was much more so it was not able to crawl whole page and that's why the links were not able to show up in crawled page.

So before crawling any site do check these parameters :) Enjoy crawling :)

PS: I am sorry i case some1 feels that I post question here and then post solution. Before posting question i actually tried a lot..

Nutch failed to crawl particular site

Answers (1)

Related Questions