Reputation: 603
import urllib2
website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()
print html
So far so good.
But I want only href links from the plain text HTML. How can I solve this problem?
Upvotes: 60
Views: 250641
Reputation: 5373
Here's a lazy version of @stephen's answer
import html.parser
import itertools
import urllib.request
class LinkParser(html.parser.HTMLParser):
def reset(self):
super().reset()
self.links = iter([])
def handle_starttag(self, tag, attrs):
if tag == 'a':
for (name, value) in attrs:
if name == 'href':
self.links = itertools.chain(self.links, [value])
def gen_links(stream, parser):
encoding = stream.headers.get_content_charset() or 'UTF-8'
for line in stream:
parser.feed(line.decode(encoding))
yield from parser.links
Use it like so:
>>> parser = LinkParser()
>>> stream = urllib.request.urlopen('http://stackoverflow.com/questions/3075550')
>>> links = gen_links(stream, parser)
>>> next(links)
'//stackoverflow.com'
Upvotes: 7
Reputation: 675
Simplest way for me:
from urlextract import URLExtract
from requests import get
url = "sample.com/samplepage/"
req = requests.get(url)
text = req.text
# or if you already have the html source:
# text = "This is html for ex <a href='http://google.com/'>Google</a> <a href='http://yahoo.com/'>Yahoo</a>"
text = text.replace(' ', '').replace('=','')
extractor = URLExtract()
print(extractor.find_urls(text))
output:
['http://google.com/', 'http://yahoo.com/']
Upvotes: 0
Reputation: 74134
Try with Beautifulsoup:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
print link.get('href')
In case you just want links starting with http://
, you should use:
soup.findAll('a', attrs={'href': re.compile("^http://")})
In Python 3 with BS4 it should be:
from bs4 import BeautifulSoup
import urllib.request
html_page = urllib.request.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a'):
print(link.get('href'))
Upvotes: 132
Reputation: 3892
This answer is similar to others with requests
and BeautifulSoup
, but using list comprehension.
Because find_all()
is the most popular method in the Beautiful Soup search API, you can use soup("a")
as a shortcut of soup.findAll("a")
and using list comprehension:
import requests
from bs4 import BeautifulSoup
URL = "http://www.yourwebsite.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, features='lxml')
# Find links
all_links = [link.get("href") for link in soup("a")]
# Only external links
ext_links = [link.get("href") for link in soup("a") if "http" in link.get("href")]
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all
Upvotes: 3
Reputation: 1298
This is way late to answer but it will work for latest python users:
from bs4 import BeautifulSoup
import requests
html_page = requests.get('http://www.example.com').text
soup = BeautifulSoup(html_page, "lxml")
for link in soup.findAll('a'):
print(link.get('href'))
Don't forget to install "requests" and "BeautifulSoup" package and also "lxml". Use .text along with get otherwise it will throw an exception.
"lxml" is used to remove that warning of which parser to be used. You can also use "html.parser" whichever fits your case.
Upvotes: 6
Reputation: 860
Using requests with BeautifulSoup and Python 3:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.website.com')
bs = BeautifulSoup(page.content, features='lxml')
for link in bs.findAll('a'):
print(link.get('href'))
Upvotes: 6
Reputation: 3517
Using BS4 for this specific task seems overkill.
Try instead:
website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
html = website.read()
files = re.findall('href="(.*tgz|.*tar.gz)"', html)
print sorted(x for x in (files))
I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-findall and works for me quite well.
I tested it only on my scenario of extracting a list of files from a web folder that exposes the files\folder in it, e.g.:
and I got a sorted list of the files\folders under the URL
Upvotes: 10
Reputation: 146164
Look at using the beautiful soup html parsing library.
http://www.crummy.com/software/BeautifulSoup/
You will do something like this:
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
for link in soup.findAll("a"):
print link.get("href")
Upvotes: 15
Reputation: 2753
My answer probably sucks compared to the real gurus out there, but using some simple math, string slicing, find and urllib, this little script will create a list containing link elements. I test google and my output seems right. Hope it helps!
import urllib
test = urllib.urlopen("http://www.google.com").read()
sane = 0
needlestack = []
while sane == 0:
curpos = test.find("href")
if curpos >= 0:
testlen = len(test)
test = test[curpos:testlen]
curpos = test.find('"')
testlen = len(test)
test = test[curpos+1:testlen]
curpos = test.find('"')
needle = test[0:curpos]
if needle.startswith("http" or "www"):
needlestack.append(needle)
else:
sane = 1
for item in needlestack:
print item
Upvotes: 5
Reputation: 49234
You can use the HTMLParser module.
The code would probably look something like this:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
# Only parse the 'anchor' tag.
if tag == "a":
# Check the list of defined attributes.
for name, value in attrs:
# If href is defined, print it.
if name == "href":
print name, "=", value
parser = MyHTMLParser()
parser.feed(your_html_string)
Note: The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will automatically adapt imports when converting your sources to 3.0.
Upvotes: 38