Reputation: 315
I am new to crawling and specially apache nutch. The configuration for apache nutch is really complex. I have been researching a lot through apache nutch and came up to the regex-urlfilter.txt file where you have to mention that which pages, you want to crawl and to limit your crawling. Since, there is not a good/simple tutorial about this that's why I am here. The explanation of the Question is given below.
Explanation
Suppose I have a website named as https://www.example.com
. Now in order to crawl only this website and limit my crawl I know I have to edit my regex-urlfilter.txt file like this +^https://www.example.com/
Now what If I want to limit this more? For example, I only want to crawl some of the pages from this given website.
https://www.example.com/something/details/1
https://www.example.com/something/details/2
https://www.example.com/something/details/3
https://www.example.com/something/details/4
https://www.example.com/something/details/5
.
.
.
https://www.example.com/something/details/10
P.S: As a new member, I may have made a lots of mistake in asking a good question. Please, help me to improve the question, instead of giving -1. I will be really thankful to you all.
Upvotes: 1
Views: 740
Reputation: 159
If you only want to crawl https://www.example.com/something/details/
and below, replace the last line of regex-urlfilter.txt from:
# accept anything else
+.
To:
+https://www.example.com/something/details/
-.
That will include only URLs that contain https://www.example.com/something/details/
, and ignore all other URLs.
Upvotes: 2