Is this web crawler doing a breadth-first search or a depth-first search?

Question

Does anyone know if the web crawler described here uses a depth-first or breadth-first search? My instinct says that it's a breadth-first search, but I'm not 100% sure of this.

Also, is it a common thing for a web crawler to use these ways of searching (in particular, using recursion)?

templatetypedef · Accepted Answer

This is a depth-first search. Note this code:

//get all links and recursively call the processPage method
Elements questions = doc.select("a[href]");
for(Element link: questions){
    if(link.attr("href").contains("mit.edu"))
        processPage(link.attr("abs:href"));
}

This code will recursively explore all the links found on the page by fully exploring the first link and everything reachable, then the second link and everything reachable, etc. Consequently, this explores in a depth-first fashion.

That said, this is going to be really slow because only one thread is doing the exploration. This would probably be a lot more efficient if it were rewritten as a modified BFS that put unexplored pages into a worklist and had a bunch of threads that grabbed unexplored pages and processed them.

It's also not a good idea to use recursion when exploring web links. You'll easily blow out the call stack if you try to do this, since any sufficiently large website will have links that go all over the place. I figured this out from experience when trying to do a DFS on Wikipedia. :-)

Hope this helps!

Is this web crawler doing a breadth-first search or a depth-first search?

Answers (1)

Related Questions