web crawler class

Question

class Crawler1(object):
    def __init__(self):
        'constructor'
        self.visited = []
        self.will_visit = []

    def reset(self):
        'reset the visited links'
        self.visited = []
        self.will_visit = []

    def crawl(self, url, n):
        'crawl to depth n starting at url'
        self.analyze(url)
        if n < 0:
            self.reset()
        elif url in self.visted:
            self.crawl(self.will_visit[-1],n-1)
        else:
            self.visited.append(url)
            self.analyze(url)
            self.visited.append(url)
            self.will_visit.pop(-1)
            self.crawl(self.will_visit[-1],n-1)


    def analyze(self, url):
        'returns the list of URLs found in the page url'
        print("Visiting", url)
        content = urlopen(url).read().decode()
        collector = Collector(url)
        collector.feed(content)
        urls = collector.getLinks()
        for i in urls:
            if i in self.will_visit:
                pass
            else:
                self.will_visit.append(i)

I want this program to run through a series of links but only as far as "n" lets it

I am not sure what is wrong with the code, though I'm sure it's plenty. Some hints would be nice.

expected output if n = 1 and on Site1 there are links for Site2 and Site3:

Visiting [Site1]
Visiting [Site2]
Visiting [Site3]

Edmund · Accepted Answer

You need to think carefully about how it should behave, especially in how it decides to crawl to another page. This code is concentrated in the crawl method:

If n < 0, then you have crawled deep enough and don't want to do anything. So simply return in that case.
Otherwise, analyze the page. Then, you want to crawl to each of the new urls, with a depth of n-1.

Part of the confusion, I think, is that you're keeping a queue of urls to visit, but also recursively crawling. For one thing, this means that queue contains not only the children of the last crawled url that you want to visit in order, but children from other nodes which were crawled but have not yet been fully processed. It's hard to manage the shape of the depth-first-search that way.

Instead, I would remove the will_visit variable, and have analyze return a list of the found links. Then process that list according to step 2 above, something like:

# Crawl this page and process its links
child_urls = self.analyze(url)    
for u in child_urls:
    if u in self.visited:
        continue  # Do nothing, because it's already been visited
    self.crawl(u, n-1)

For this to work you need to also change analyze to simply return the list of urls, rather than putting them into the stack:

def analyze(self, url):
    ...
    urls = collector.getLinks()
    returns urls

web crawler class

Answers (1)

Related Questions