blackmamba
blackmamba

Reputation: 1982

Python mechanize to iterate over all webpages of a website

I want to iterate over all the webpages of a website. I am trying to use mechanize here but it only looks over the main links of the website. How should I modify it?

import mechanize
import lxml.html

br = mechanize.Browser()
response = br.open("http://www.apple.com")

for link in br.links():
    print link.url
    br.follow_link(link)  # takes EITHER Link instance OR keyword args
    print br
    br.back()

This is the new code:

import mechanize
import lxml.html

links  = set()                             
visited_links  = set()


def visit(br, url):
  response = br.open(url)
  links = br.links()
  for link in links:
    if not link.url in links:
      visited_links.add(link.url)  
      visit(br, link)
      print link.url


if __name__ == '__main__':
  br = mechanize.Browser()
  visit(br,"http://www.apple.com")

Upvotes: 1

Views: 1123

Answers (1)

Frerich Raabe
Frerich Raabe

Reputation: 94319

Notice how what you want to do for each link is the same as what you did for your initial link: fetch the page and visit each link. You could solve this recursively like

def visit(br, url):
  response = br.open(url)
  links = br.links()
  for link in links:
    print link.url
    visit(br, link)

It'll get a bit more complicated in practice:

  1. You need to detect cycles, i.e. if a.html links to b.html, and b.html links to a.html you don't want to play ping pong and go back and forth all the time. So you probably need some way to tell whether you have visited a page already. Since you might find a lot of pages, you should have an efficient way to test whether you visited a page already. One straightforward way might be to have a global Python set with the seen links.

  2. You need to make up your mind about when two links are equal, e.g. should http://www.apple.com/index.html and 'http://www.apple.com/index.html#someAnchor` be equal or not? You might need to come up with some sort of "normalization" of links.

  3. Your program might take a long time, and it most certainly will be "I/O bound", i.e. your program will sit there waiting for some page to download. You could accelerate things by considering to visit multiple pages in parallel - they would need to use a shared set of seen pages though, so that two jobs don't visit the same page.

Upvotes: 1

Related Questions