Andrey
Andrey

Reputation: 859

Get all URLs on a Page Python

I'm working on something that requires me to get all the URLs on a page. It seems to work on most websites I've tested, for example microsoft.com, but it only returns three from google.com. Here is the relevant source code:


   import urllib
   import time
   import re
   fwcURL = "http://www.microsoft.com" #URL to read
   mylines = urllib.urlopen(fwcURL).readlines()
   print "Found URLs:"
   time.sleep(1) #Pause execution for a bit
   for item in mylines:
     if "http://" in item.lower(): #For http
       print item[item.index("http://"):].split("'")[0].split('"')[0] # Remove ' and " from the end, for example in href=
     if "https://" in item.lower(): #For https
       print item[item.index("https://"):].split("'")[0].split('"')[0] # Ditto

If my code can be improved, or if there is a better way to do this, please respond. Thanks in advance!

Upvotes: 1

Views: 4469

Answers (3)

Jon Clements
Jon Clements

Reputation: 142156

I would use lxml and do:

import lxml.html

page = lxml.html.parse('http://www.microsoft.com').getroot()
anchors = page.findall('a')

It's worth noting that if links are dynamically generated (via JS or similar), then you won't get those short of automating a browser in some fashion.

Upvotes: 2

Froyo
Froyo

Reputation: 18477

Try Using Mechanize or BeautifulSoup or lxml.

By using BeautifulSoup, you can easily get all the html/xml content very easily.

import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("some_url")
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
    print link["href"]

BeautifulSoup is very easy to learn and understand.

Upvotes: 3

Antimony
Antimony

Reputation: 39451

First off, HTML is not a regular language, and no amount of simple string manipulation like that is going to work on all pages. You need a real HTML parser. I'd recommend Lxml. Then it's just a matter of recursing through the tree and finding the elements you want.

Second, some pages may be dynamic, so you won't find all of the contents in the html source. Google makes heavy use of javascript and AJAX (notice how it displays results without reloading the page).

Upvotes: 2

Related Questions