Reputation: 1759
I'm implementing a crawler for a website with a growing number of entities. There is no information available how many entities exist and no list of all entities. Every entity can be accessed with an URL like this: http://www.somewebsite.com/entity_{i}
where {i}
is the number of the entity, starting with 1 and incrementing by 1.
To crawle every entity I'm running a loop which checks if a HTTP requests returns a 200
or 404
. If I get a 404 NOT FOUND
, the loop stops and I'm sure I have all entities.
The serial way looks like this:
def atTheEnd = false
def i = 0
while(!atTheEnd){
atTheEnd = !crawleWebsite("http://www.somewebsite.com/entity_" + i)
i++
}
crawleWebsite()
returns true if it succeed and false if it got an 404 NOT FOUND
error.
The problem is crawling those entities can take very long that's why I want to do it in multiple threads but I don't know the total amount of entities so every task isn't independent from the other tasks.
Whats the best way to solve this problem?
My approach would be this: Using binary search with REST HEAD requests to get the total number of entities (between 500 and 1000) and split those to some threads.
Is there maybe a better way doing this?
tl;dr
Basically I want to tell a threadpool to programmatically create new tasks until a condition is satisfied (when the first 404
occured) and to wait until every task has finished.
Note: I'm implementing this code using Grails 3
.
Upvotes: 1
Views: 151
Reputation: 1987
As you said, the total number of entities is not known and can go into thousands. In this case I would simply go for a fixed thread pool and speculatively query URLs even though you may have already reached the end. Consider this example.
@Grab(group = 'org.codehaus.gpars', module = 'gpars', version = '1.2.1')
import groovyx.gpars.GParsPool
//crawling simulation - ignore :-)
def crawleWebsite(url) {
println "$url:${Thread.currentThread().name}"
Thread.sleep (1)
Math.random() * 1000 < 950
}
final Integer step = 50
Boolean atTheEnd = false
Integer i = 0
while (true) {
GParsPool.withPool(step) {
(i..(i + step)).eachParallel{atTheEnd = atTheEnd || !crawleWebsite("http://www.somewebsite.com/entity_" + it)}
}
if (atTheEnd) {
break
}
i += step
}
The threadpool is set to 50 and once all 50 URLs are crawled we check if we reached the end. If not we carry on.
Obviously in the worst case scenario you can crawl 50 404
s. But I'm sure you could get away with it :-)
Upvotes: 1