Reputation: 71
I am trying to implement a web crawler with python. Here is what I have so far:
import urllib2
seed=raw_input('Enter a url : ')
def getAllNewLinksOnPage(page,prevLinks):
response = urllib2.urlopen(page)
html = response.read()
links,pos,allFound=[],0,False
while not allFound:
aTag=html.find("<a href=",pos)
if aTag>-1:
href=html.find('"',aTag+1)
endHref=html.find('"',href+1)
url=html[href+1:endHref]
if url[:7]=="http://":
if url[-1]=="/":
url=url[:-1]
if not url in links and not url in prevLinks:
links.append(url)
print url
closeTag=html.find("</a>",aTag)
pos=closeTag+1
else:
allFound=True
return links
toCrawl=[seed]
crawled=[]
while toCrawl:
url=toCrawl.pop()
crawled.append(url)
newLinks=getAllNewLinksOnPage(url,crawled)
toCrawl=list(set(toCrawl)|set(newLinks))
print crawled
I was wondering how I would go about implementing a depth search and sorting the results.
Upvotes: 2
Views: 9779
Reputation: 365587
What you've implemented so far is a sort of random-order search, because you're keeping a set
of links to crawl. (You're actually keeping a list
, but converting it back and forth to a set
repeatedly, which scrambles whatever order you had.)
To turn this into a depth-first search, the usual solution is to do it recursively. Then you don't need any external storage of links to crawl. You do still need to keep track of links crawled so far—both to avoid repeats, and because you want to sort the links at the end (which requires having something to sort), but that's it. So:
def crawl(seed):
crawled = set()
def crawl_recursively(link):
if link in crawled:
return
newLinks = getAllLinksOnPage(link)
crawled.add(seed)
for link in newLinks:
crawl_recursively(link)
crawl_recursively(seed)
return sorted(crawled)
If you don't want to use recursion, the alternative is to use an explicit stack of links to crawl. But you can't keep reorganizing that stack, or it's not a stack anymore. Again, a separate set of already-crawled links will solve the problem of avoiding repeated lookups.
def crawl(seed):
crawled = set()
to_crawl = [seed]
while to_crawl:
link = to_crawl.pop()
if link in crawled:
continue
crawled.add(link)
newLinks = getAllLinksOnPage(link)
to_crawl.extend(newLinks)
return sorted(crawled)
Turning the stack into a queue (just change one line to to_crawl.pop(0)
) makes this a breadth-first search.
If you're worried about to_crawl
getting too big because it's full of repeats, you can strip them out as you go. Since you wanted to go depth-first, you want to strip out ones that are stacked up for later, not the new ones. The simplest way to do this is probably with a OrderedSet
(e.g., the recipe linked from the collections
docs):
to_crawl = OrderedSet()
# …
new_link_set = OrderedSet(newLinks)
to_crawl -= new_link_set
to_crawl |= new_link_set - crawled
Upvotes: 4