ZCJ
ZCJ

Reputation: 499

List inside a list

What's the purpose of the second line? Don't we have to pass in a list as our seed parameter anyway? I'd think you could just use seed in all the areas we have our tocrawl variable, instead of having a list inside a list.

def crawl_web(seed):
    tocrawl = [seed]
    crawled = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            union(tocrawl, get_all_links(get_page(page)))
            crawled.append(page)
    return crawled

EDIT: full script - http://codepad.org/qxuzVkof

Upvotes: 2

Views: 1334

Answers (3)

machine yearning
machine yearning

Reputation: 10119

tl;dr your seed value is not a list, tocrawl shouldn't be one (it should instead be a data structure which supports unions, such as a set)

This is your basic graph traversal algorithm. It's hard to explain the importance of the one variable without understanding its context, so allow me to over-indulge you. >:)

The web is what mathematicians like to call a graph. Each address on the web is a node in the graph and a link (which joins two addresses) is called an edge. So "crawling the web" is just finding paths along link-edges to different address-nodes in the web-graph.

def crawl_web(seed):

At every node you're going to have to check for new nodes to visit which you can reach from the current one. But you have to start somewhere; here the code says to start at node seed by initializing the list of known, unvisited nodes to seed:

    tocrawl = [seed]

You also want to keep track of visited nodes so you don't keep going around in circles:

    crawled = []

Then you have to begin your traversal. Keep crawling until there are no more nodes to crawl. :

    while tocrawl:

Each iteration of the loop will visit another node, which we get from our list (initially just the seed node) by popping a single value off the end:

        page = tocrawl.pop()

Don't visit previously crawled nodes; just continue:

        if page not in crawled:

I don't think in general you can do unions on lists in python, you'd probably have to make a set to do something like this:

            # union(tocrawl, get_all_links(get_page(page)))

but the gist of it is that you're collecting all the edges from the current node and adding their nodes to the list (set?) of known, uncrawled addresses. You could define a list union function the following way, noting that it didn't preserve order:

def list_union(a, b):
    # "|" is the set union operator
    return list(set(a) | set(b))

Lastly just remember that you've visited the current address already! Clicking around in link circles (cycles is the technical term, loops if a node has an edge to itself)

     crawled.append(page)

And when you're all done, return the full list of addresses reachable from your starting address by following links (the ones you've been to).

    return crawled

Now let's see the whole thing in context:

def crawl_web(seed):
    tocrawl = [seed]
    crawled = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            list_union(tocrawl, get_all_links(get_page(page)))
            crawled.append(page)
    return crawled

Upvotes: 2

phihag
phihag

Reputation: 287755

seed is not a list, but a single item, most likely the URL of the first page to crawl.

While one could change the design of the function to allow for multiple seed URLs, that would complicate the function's signature - almost every caller will just want to crawl from one URL. It is also bad style to have a function modify the caller's variables; if the function would take a parameter tocrawl, you'd have to document (and get every caller to know) that:

  • The tocrawl represents all the URLs that serve as initial starting points
  • tocrawl must be a list (or a compatible type); an ordinary iterable does not support pop. It must also conform to the properties that union expects. This is severely unpythonic.
  • If the function exits, the argument list will be empty. This means almost any caller will want to pass a copy of the list.
  • If the function exits with an exception, tocrawl will contain a snapshot of URLs yet to crawl.

Since the last two points essentially mean that the input parameter will get corrupted, a better design would be to make the argument an arbitrary iterable, and then construct a list. That way, the caller doesn't have to worry about all these silly conventions.

One more thing: Instead of writing your own union implementation and using a list to store all crawled websites, you should consider using a set in both instances, like this:

def crawl_web(seed):
    tocrawl = set([seed])
    crawled = set()
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            tocrawl.update(get_all_links(get_page(page)))
            crawled.add(page)
    return crawled

Upvotes: 6

Carles Barrobés
Carles Barrobés

Reputation: 11683

In this algorithm, seed is a page, not a list.

tocrawl is a stack where you collect linked pages you want to visit in order to find more links.

Upvotes: 1

Related Questions