Reputation: 499
What's the purpose of the second line? Don't we have to pass in a list as our seed
parameter anyway? I'd think you could just use seed in all the areas we have our tocrawl
variable, instead of having a list inside a list.
def crawl_web(seed):
tocrawl = [seed]
crawled = []
while tocrawl:
page = tocrawl.pop()
if page not in crawled:
union(tocrawl, get_all_links(get_page(page)))
crawled.append(page)
return crawled
EDIT: full script - http://codepad.org/qxuzVkof
Upvotes: 2
Views: 1334
Reputation: 10119
tl;dr your seed value is not a list, tocrawl shouldn't be one (it should instead be a data structure which supports unions, such as a set)
This is your basic graph traversal algorithm. It's hard to explain the importance of the one variable without understanding its context, so allow me to over-indulge you. >:)
The web is what mathematicians like to call a graph. Each address on the web is a node in the graph and a link (which joins two addresses) is called an edge. So "crawling the web" is just finding paths along link-edges to different address-nodes in the web-graph.
def crawl_web(seed):
At every node you're going to have to check for new nodes to visit which you can reach from the current one. But you have to start somewhere; here the code says to start at node seed
by initializing the list of known, unvisited nodes to seed
:
tocrawl = [seed]
You also want to keep track of visited nodes so you don't keep going around in circles:
crawled = []
Then you have to begin your traversal. Keep crawling until there are no more nodes to crawl. :
while tocrawl:
Each iteration of the loop will visit another node, which we get from our list (initially just the seed
node) by popping a single value off the end:
page = tocrawl.pop()
Don't visit previously crawled nodes; just continue:
if page not in crawled:
I don't think in general you can do unions on lists in python, you'd probably have to make a set to do something like this:
# union(tocrawl, get_all_links(get_page(page)))
but the gist of it is that you're collecting all the edges from the current node and adding their nodes to the list (set?) of known, uncrawled addresses. You could define a list union function the following way, noting that it didn't preserve order:
def list_union(a, b):
# "|" is the set union operator
return list(set(a) | set(b))
Lastly just remember that you've visited the current address already! Clicking around in link circles (cycles is the technical term, loops if a node has an edge to itself)
crawled.append(page)
And when you're all done, return the full list of addresses reachable from your starting address by following links (the ones you've been to).
return crawled
Now let's see the whole thing in context:
def crawl_web(seed):
tocrawl = [seed]
crawled = []
while tocrawl:
page = tocrawl.pop()
if page not in crawled:
list_union(tocrawl, get_all_links(get_page(page)))
crawled.append(page)
return crawled
Upvotes: 2
Reputation: 287755
seed
is not a list, but a single item, most likely the URL of the first page to crawl.
While one could change the design of the function to allow for multiple seed URLs, that would complicate the function's signature - almost every caller will just want to crawl from one URL. It is also bad style to have a function modify the caller's variables; if the function would take a parameter tocrawl
, you'd have to document (and get every caller to know) that:
tocrawl
represents all the URLs that serve as initial starting pointstocrawl
must be a list
(or a compatible type); an ordinary iterable does not support pop
. It must also conform to the properties that union
expects. This is severely unpythonic.tocrawl
will contain a snapshot of URLs yet to crawl.Since the last two points essentially mean that the input parameter will get corrupted, a better design would be to make the argument an arbitrary iterable, and then construct a list. That way, the caller doesn't have to worry about all these silly conventions.
One more thing: Instead of writing your own union
implementation and using a list to store all crawled websites, you should consider using a set
in both instances, like this:
def crawl_web(seed):
tocrawl = set([seed])
crawled = set()
while tocrawl:
page = tocrawl.pop()
if page not in crawled:
tocrawl.update(get_all_links(get_page(page)))
crawled.add(page)
return crawled
Upvotes: 6
Reputation: 11683
In this algorithm, seed is a page, not a list.
tocrawl is a stack where you collect linked pages you want to visit in order to find more links.
Upvotes: 1