GML-VS
GML-VS

Reputation: 1100

How to crawl only HTML in Nutch?

Is it possible to crawl/fetch only plain HTML pages via Nutch (i.e. no pictures, video, flash, excel, exe, pdf or word files)?

How to check Content-Type of the page and fetch only text/html pages via Nutch?

Upvotes: 1

Views: 593

Answers (1)

GML-VS
GML-VS

Reputation: 1100

Edit conf/regex-urlfilter.txt:

Set files suffix for ignore:

-\.(jpg|gif|zip|ico)$ 

Upvotes: 1

Related Questions