Reputation: 1091
I made a simple crawler using simplecrawler
:D
Its constructor has a set object which hold visited URLs:
this.visited = new Set();
Any invalid URL will be added there:
this.visited.add(url);
Currently, when new url is added in the queue I check if it is visited:
if (this.visited.has(newURL))
Can I have regEx in this set object to block url from specific site to be used as below?
// to block www.xxx.com/123, www.xxx.com/456, www.xxx.com/789
this.visited.add('/www\.xxx\.com\/\d/g');
if (this.visited.has(givenURL))
// do not visit
else
// visit
If this can be done, what would be the best way to get this done?
Upvotes: 0
Views: 459
Reputation: 1066
You could loop over the Set and check if a URL matches the item in the set:
this.visited = new Set();
var BreakException = {};
this.visited.add('www\\.xxx\\.com/\\d+');
this.visited.add('www.xxx.com/123')
try {
this.visited.forEach(function(x) {
if ('www.xxx.com/123'.match(new RegExp(x))) {
var visited = true;
throw BreakException;
}
});
} catch (e) {
// do not visit
}
if (visited) {
// visit
}
Pay attention on the URL I added to the set. The one you used in the question wouldn't work.
You have to throw an exception to break the loop since Array.forEach
doesn't support break;
.
Upvotes: 1