Reputation: 1100
Is it possible to crawl/fetch only plain HTML pages via Nutch (i.e. no pictures, video, flash, excel, exe, pdf or word files)?
How to check Content-Type
of the page and fetch only text/html
pages via Nutch?
Upvotes: 1
Views: 593
Reputation: 1100
Edit conf/regex-urlfilter.txt
:
Set files suffix for ignore:
-\.(jpg|gif|zip|ico)$
Upvotes: 1