Reputation:
I would like to extract all links from a web page. Here is my code so far.
import mechanize
import lxml.html
from time import sleep
links = list()
visited_links = list()
br = mechanize.Browser()
def findLinks(url):
response = br.open(url)
visited_links.append(response.geturl())
for link in br.links():
response = br.follow_link(link)
links.append(response.geturl())
sleep(1)
findLinks("http://temelelektronik.net")
for link in links:
if link in visited_links:
links.remove(link)
else:
findLinks(link)
print link
for link in visited_links:
print link
In fact I don't want to write a web crawler. What I'd like to do is extract all links from a web page and create a site map. I also wonder whether is it possible to get last modification time of a file from server using mechanize and python.
What I'd like to ask is while this code snippet works fine for HTML pages. It doesn't extract links from php pages. For example this page. How can I extract links from php pages?
Any help would be appreciated. Thanks..
Upvotes: 0
Views: 1902
Reputation: 1
The syntax that is specific to Mechanize goes as follows.
agent=Mechanize.new
page=agent.get(URL)
page.links returns an array of all links in the page.
page.links.first.text returns the text (without the href) of the first link.
page.link_with(:text=>"Text").click would return the page that would result on clicking the specific page
Hope this helps
Upvotes: -1
Reputation:
Here is another solution which uses a web spider to visit each link.
import os, sys; sys.path.insert(0, os.path.join("..", ".."))
from pattern.web import Spider, DEPTH, BREADTH, FIFO, LIFO
class SimpleSpider1(Spider):
def visit(self, link, source=None):
print "visiting:", link.url, "from:", link.referrer
def fail(self, link):
print "failed:", link.url
spider1 = SimpleSpider1(links=["http://www.temelelektronik.net/"], domains=["temelelektronik.net"], delay=0.0)
print "SPIDER 1 " + "-" * 50
while len(spider1.visited) < 5:
spider1.crawl(cached=False)
Upvotes: 0
Reputation: 20679
I don't know mechanize, but I have used the pattern.web module, which has an HTML DOM Parser easy to use. I think for a site map is similar to what you are looking for:
from pattern.web import URL, DOM
url = URL("http://temelelektronik.net")
dom = DOM(url.download())
for anchor in dom.by_tag('a'):
print(anchor.href)
Upvotes: 1