jreid423
jreid423

Reputation: 71

Trying to create a python web crawler

I am trying to implement a web crawler with python. Here is what I have so far:

import urllib2

seed=raw_input('Enter a url : ')

def getAllNewLinksOnPage(page,prevLinks):

response = urllib2.urlopen(page)
html = response.read()

links,pos,allFound=[],0,False
while not allFound:
    aTag=html.find("<a href=",pos)
    if aTag>-1:
        href=html.find('"',aTag+1)
        endHref=html.find('"',href+1)
        url=html[href+1:endHref]
        if url[:7]=="http://":
            if url[-1]=="/":
                url=url[:-1]
            if not url in links and not url in prevLinks:
                links.append(url)     
                print url
        closeTag=html.find("</a>",aTag)
        pos=closeTag+1
    else:
        allFound=True   
return links

toCrawl=[seed]
crawled=[]
while toCrawl:
url=toCrawl.pop()
crawled.append(url)
newLinks=getAllNewLinksOnPage(url,crawled)
toCrawl=list(set(toCrawl)|set(newLinks))

print crawled   

I was wondering how I would go about implementing a depth search and sorting the results.

Upvotes: 2

Views: 9779

Answers (1)

abarnert
abarnert

Reputation: 365587

What you've implemented so far is a sort of random-order search, because you're keeping a set of links to crawl. (You're actually keeping a list, but converting it back and forth to a set repeatedly, which scrambles whatever order you had.)

To turn this into a depth-first search, the usual solution is to do it recursively. Then you don't need any external storage of links to crawl. You do still need to keep track of links crawled so far—both to avoid repeats, and because you want to sort the links at the end (which requires having something to sort), but that's it. So:

def crawl(seed):
    crawled = set()
    def crawl_recursively(link):
        if link in crawled:
            return
        newLinks = getAllLinksOnPage(link)
        crawled.add(seed)
        for link in newLinks:
            crawl_recursively(link)
    crawl_recursively(seed)
    return sorted(crawled)

If you don't want to use recursion, the alternative is to use an explicit stack of links to crawl. But you can't keep reorganizing that stack, or it's not a stack anymore. Again, a separate set of already-crawled links will solve the problem of avoiding repeated lookups.

def crawl(seed):
    crawled = set()
    to_crawl = [seed]
    while to_crawl:
        link = to_crawl.pop()
        if link in crawled:
            continue
        crawled.add(link)
        newLinks = getAllLinksOnPage(link)
        to_crawl.extend(newLinks)
    return sorted(crawled)

Turning the stack into a queue (just change one line to to_crawl.pop(0)) makes this a breadth-first search.

If you're worried about to_crawl getting too big because it's full of repeats, you can strip them out as you go. Since you wanted to go depth-first, you want to strip out ones that are stacked up for later, not the new ones. The simplest way to do this is probably with a OrderedSet (e.g., the recipe linked from the collections docs):

    to_crawl = OrderedSet()
    # …
        new_link_set = OrderedSet(newLinks)
        to_crawl -= new_link_set
        to_crawl |= new_link_set - crawled

Upvotes: 4

Related Questions