Farukh Khan
Farukh Khan

Reputation: 315

Apache Nutch url's in regex-urlfilter.txt file

I am new to crawling and specially apache nutch. The configuration for apache nutch is really complex. I have been researching a lot through apache nutch and came up to the regex-urlfilter.txt file where you have to mention that which pages, you want to crawl and to limit your crawling. Since, there is not a good/simple tutorial about this that's why I am here. The explanation of the Question is given below.

Explanation

Suppose I have a website named as https://www.example.com. Now in order to crawl only this website and limit my crawl I know I have to edit my regex-urlfilter.txt file like this +^https://www.example.com/ Now what If I want to limit this more? For example, I only want to crawl some of the pages from this given website.

https://www.example.com/something/details/1
https://www.example.com/something/details/2
https://www.example.com/something/details/3
https://www.example.com/something/details/4
https://www.example.com/something/details/5
.
.
.
https://www.example.com/something/details/10

P.S: As a new member, I may have made a lots of mistake in asking a good question. Please, help me to improve the question, instead of giving -1. I will be really thankful to you all.

Upvotes: 1

Views: 740

Answers (1)

Yossi
Yossi

Reputation: 159

If you only want to crawl https://www.example.com/something/details/ and below, replace the last line of regex-urlfilter.txt from:

# accept anything else
+.

To:

+https://www.example.com/something/details/
-.

That will include only URLs that contain https://www.example.com/something/details/, and ignore all other URLs.

Upvotes: 2

Related Questions