Reputation: 1982
I want to iterate over all the webpages of a website. I am trying to use mechanize here but it only looks over the main links of the website. How should I modify it?
import mechanize
import lxml.html
br = mechanize.Browser()
response = br.open("http://www.apple.com")
for link in br.links():
print link.url
br.follow_link(link) # takes EITHER Link instance OR keyword args
print br
br.back()
This is the new code:
import mechanize
import lxml.html
links = set()
visited_links = set()
def visit(br, url):
response = br.open(url)
links = br.links()
for link in links:
if not link.url in links:
visited_links.add(link.url)
visit(br, link)
print link.url
if __name__ == '__main__':
br = mechanize.Browser()
visit(br,"http://www.apple.com")
Upvotes: 1
Views: 1123
Reputation: 94319
Notice how what you want to do for each link is the same as what you did for your initial link: fetch the page and visit each link. You could solve this recursively like
def visit(br, url):
response = br.open(url)
links = br.links()
for link in links:
print link.url
visit(br, link)
It'll get a bit more complicated in practice:
You need to detect cycles, i.e. if a.html links to b.html, and b.html links to a.html you don't want to play ping pong and go back and forth all the time. So you probably need some way to tell whether you have visited a page already. Since you might find a lot of pages, you should have an efficient way to test whether you visited a page already. One straightforward way might be to have a global Python set
with the seen links.
You need to make up your mind about when two links are equal, e.g. should http://www.apple.com/index.html
and 'http://www.apple.com/index.html#someAnchor` be equal or not? You might need to come up with some sort of "normalization" of links.
Your program might take a long time, and it most certainly will be "I/O bound", i.e. your program will sit there waiting for some page to download. You could accelerate things by considering to visit multiple pages in parallel - they would need to use a shared set
of seen pages though, so that two jobs don't visit the same page.
Upvotes: 1