andy4thehuynh
andy4thehuynh

Reputation: 2172

Ruby: recursion confusion

I am reading the code of someone who built a domain crawler method using ruby. I am new to the concept of recursion and can not wrap my head on how to read their code.

Code:

def crawl_domain(url, page_limit = 100)
  return if @already_visited.size == page_limit                # [1]
  url_object = open_url(url)
  return if url_object == nil                                  # [2]
  parsed_url = parse_url(url_object)
  return if parsed_url == nil                                  # [3]
  @already_visited[url]=true if @already_visited[url] == nil
  page_urls = find_urls_on_page(parsed_url, url)
  page_urls.each do |page_url|
    if urls_on_same_domain?(url, page_url) and @already_visited[page_url] == nil
      crawl_domain(page_url)
    end
  end
end

Questions:

  1. What do the combination of consecutive return statements mean?
  2. At line [1], if @already_visited's size is NOT the same as page_limit does the program break out of the crawl_domain method and skips the rest of the code?
  3. if @already_visited's size is the same as page_limit, does it move on to the next return statement after it sets url_object = open_url(url)

Thanks for any help in advance!

Source: http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/

Upvotes: 1

Views: 1226

Answers (3)

Kathryn
Kathryn

Reputation: 1607

Recursion simply means the method calls itself at least once in order to accomplish it's task. Sometimes an algorithm can be expressed more cleanly recursively than iteratively. In the case of a domain crawler, you can imagine that once you know how to crawl the top-level page, you can use the same technique for each internal link:

def crawl_domain(url)
  page_urls = find_urls_on_page(url)
  page_urls.each do |page_url|
    if urls_on_same_domain?(url, page_url)

      # We can assume the method that works for the top level will work
      # for all the urls on the page too.
      crawl_domain(page_url)
    end  
  end
end

That's the core of the method, but it has a major problem: it'll never end. (Well, it would end if the page it lands on doesn't have any internal links.) So that's why (to answer question #1) the method has some early return calls. If any of those returns are called, the method won't call itself.

To answer question #2, the return will only break out of the method if @already_visited.size == page_limit. Notice that @already_visited keeps track of the urls the method has visited to avoid visiting them again. The size of the hash also tells you how many urls you've visited. So the first return kicks you out of the method if the number of urls reaches page_limit. (page_limit defaults to 100, but it's odd that the recursive call doesn't provide the limit it was provided with. I think that might be a bug.)

For your third question, it seems you are misunderstanding conditional modifiers. The statement is only executed if the condition is true. So the method only continues if you haven't yet hit the visited-url limit.

I think a major point of confusion is that it's more common to see a single return call at the end of the function. Some people even advocate a single exit point. But there's a good reason to structure recursive methods with the self-call at the end: tail recursion. So with recursive methods, you'll often see early return calls to bail out early. And since it can be hard to spot those returns in the middle of a deeply nested conditional, they tend to use this basic format:

return if some_reason_to_bail == true

Upvotes: 1

matt
matt

Reputation: 534966

Forget the web crawler. Consider how to use recursion to add the numbers constituting an array:

arr = [1, 2, 3, 4]
def sum(array)
  # ??
end
puts sum(arr) #=> 10, please

Pretend that the problem is already solved. (All recursion depends on pretending this.) Then sum([1, 2, 3, 4]) is 1 + sum([2, 3, 4]), and in general sum for any array is the first element plus sum for the remainder of the array. In ruby, the method that splits an array into its first and its remainder is shift; calling shift returns the first element of the array and removes it from the array. So we may write:

arr = [1, 2, 3, 4]
def sum(array)
  return array.shift + sum(array)
end
puts sum(arr)

Behold, a recursive solution! However, there's a problem: we recurse forever (or, at least, until some sort of error is encountered). It is crucial in all recursions to put in a "stop" of some sort in the degenerate case before recursing. What is that degenerate case in our situation? It's when the array is empty! (Which it eventually will be, since every sum call removes an element from the array.) For an empty array, obviously the sum is zero. So:

arr = [1, 2, 3, 4]
def sum(array)
  return 0 unless array.length > 0
  return array.shift + sum(array)
end
puts sum(arr)

The end.

Your web crawler is similar. We will recurse to crawl each sublevel of the current page, but first we have some "stops" (the return statements) to prevent banging into an error situation in various degenerate cases. (In the case of the web crawler, there is no need to check for whether the current page actually has sublevels, because if it doesn't, each will do nothing and we won't recurse.)

Upvotes: 3

matt
matt

Reputation: 534966

Confused by recursion? Read this book:

http://mitpress.mit.edu/sicp/

You'll be a better programmer and you'll recurse a blue streak forever after.

Upvotes: 2

Related Questions