Reputation: 859
I'm working on something that requires me to get all the URLs on a page. It seems to work on most websites I've tested, for example microsoft.com, but it only returns three from google.com. Here is the relevant source code:
import urllib
import time
import re
fwcURL = "http://www.microsoft.com" #URL to read
mylines = urllib.urlopen(fwcURL).readlines()
print "Found URLs:"
time.sleep(1) #Pause execution for a bit
for item in mylines:
if "http://" in item.lower(): #For http
print item[item.index("http://"):].split("'")[0].split('"')[0] # Remove ' and " from the end, for example in href=
if "https://" in item.lower(): #For https
print item[item.index("https://"):].split("'")[0].split('"')[0] # Ditto
If my code can be improved, or if there is a better way to do this, please respond. Thanks in advance!
Upvotes: 1
Views: 4469
Reputation: 142156
I would use lxml and do:
import lxml.html
page = lxml.html.parse('http://www.microsoft.com').getroot()
anchors = page.findall('a')
It's worth noting that if links are dynamically generated (via JS or similar), then you won't get those short of automating a browser in some fashion.
Upvotes: 2
Reputation: 18477
Try Using Mechanize or BeautifulSoup or lxml.
By using BeautifulSoup, you can easily get all the html/xml content very easily.
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("some_url")
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print link["href"]
BeautifulSoup
is very easy to learn and understand.
Upvotes: 3
Reputation: 39451
First off, HTML is not a regular language, and no amount of simple string manipulation like that is going to work on all pages. You need a real HTML parser. I'd recommend Lxml. Then it's just a matter of recursing through the tree and finding the elements you want.
Second, some pages may be dynamic, so you won't find all of the contents in the html source. Google makes heavy use of javascript and AJAX (notice how it displays results without reloading the page).
Upvotes: 2