user1647372
user1647372

Reputation:

Recursion in python web crawler

I am attempting to make a little web crawler in python. What seems to be tripping me up right now is the recursive part and depth of this problem. Given a url and a maxDepth of how many sites from there I want to link to I then add the url to the set of searched sites, and download all the text and links from the site. For all the links that the url has in it, I want to search each link and get it's words and links. What the problem is, is that when I go to recursively call the next url, the depth is already at maxDepth and it stops after going to only 1 more page. Hopefully I explained it decently, basically the question I am asking is how do I do all the recursive calls and then set self._depth += 1?

def crawl(self,url,maxDepth):        

    self._listOfCrawled.add(url)

    text = crawler_util.textFromURL(url).split()

    for each in text:
        self._index[each] = url

    links = crawler_util.linksFromURL(url)

    if self._depth < maxDepth:
        self._depth = self._depth + 1
        for i in links:
            if i not in self._listOfCrawled:
                self.crawl(i,maxDepth) 

Upvotes: 3

Views: 5529

Answers (1)

tobias_k
tobias_k

Reputation: 82899

The problem with your code is that you increase self.depth each time you call the function, and since it is a variable of the instance, it stays increased in the following calls. Let's say maxDepth is 3 and you have a URL A that links to pages B, and C, and B links to D, and C has a link to E. Your call hierarchy then looks like this (assuming that self._depth is 0 at the beginning):

crawl(self, A, 3)          # self._depth set to 1, following links to B and C
    crawl(self, B, 3)      # self._depth set to 2, following link to D
        crawl(self, D, 3)  # self._depth set to 3, no links to follow
    crawl(self, C, 3)      # self._depth >= maxDepth, skipping link to E

In other words, instead of the depth of the current call, you track the accumulated number of calls to crawl.

Instead, try something like this:

def crawl(self,url,depthToGo):
    # call this method with depthToGo set to maxDepth
    self._listOfCrawled.add(url)
    text = crawler_util.textFromURL(url).split()
    for each in text:
        # if word not in index, create a new set, then add URL to set
        if each not in self._index:
            self._index[each] = set([])
        self._index[each].add(url)
    links = crawler_util.linksFromURL(url)
    # check if we can go deeper
    if depthToGo > 0:
        for i in links:
            if i not in self._listOfCrawled:
                # decrease depthToGo for next level of recursion
                self.crawl(i, depthToGo - 1) 

Upvotes: 3

Related Questions