Reputation: 1995
Does Python have any way of downloading an entire HTML
page and its contents (images, css) to a local folder given a url. And updating local html file to pick content locally.
Upvotes: 64
Views: 133830
Reputation: 7632
savePage
bellow:.html
and downloaded javascripts
, css
and images
based on the tags script, link and img (tags_inner
dict keys)._files
.sys.stderr
Uses Python 3+ Requests, BeautifulSoup and other standard libraries.
The function savePage
receives a url
and pagepath
where to save it.
import os, sys, re
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def savePage(url, pagepath='page'):
def savenRename(soup, pagefolder, session, url, tag, inner):
if not os.path.exists(pagefolder): # create only once
os.mkdir(pagefolder)
for res in soup.findAll(tag): # images, css, etc..
if res.has_attr(inner): # check inner tag (file object) MUST exists
try:
filename, ext = os.path.splitext(os.path.basename(res[inner])) # get name and extension
filename = re.sub('\W+', '', filename) + ext # clean special chars from name
fileurl = urljoin(url, res.get(inner))
filepath = os.path.join(pagefolder, filename)
# rename html ref so can move html and folder of files anywhere
res[inner] = os.path.join(os.path.basename(pagefolder), filename)
if not os.path.isfile(filepath): # was not downloaded
with open(filepath, 'wb') as file:
filebin = session.get(fileurl)
file.write(filebin.content)
except Exception as exc:
print(exc, file=sys.stderr)
session = requests.Session()
#... whatever other requests config you need here
response = session.get(url)
soup = BeautifulSoup(response.text, "html.parser")
path, _ = os.path.splitext(pagepath)
pagefolder = path+'_files' # page contents folder
tags_inner = {'img': 'src', 'link': 'href', 'script': 'src'} # tag&inner tags to grab
for tag, inner in tags_inner.items(): # saves resource files and rename refs
savenRename(soup, pagefolder, session, url, tag, inner)
with open(path+'.html', 'wb') as file: # saves modified html doc
file.write(soup.prettify('utf-8'))
Example saving google.com
as google.html
and contents on google_files
folder. (current folder)
savePage('https://www.google.com', 'google')
Upvotes: 13
Reputation: 15295
What you're looking for is a mirroring tool. If you want one in Python, PyPI lists spider.py but I have no experience with it. Others might be better but I don't know - I use 'wget', which supports getting the CSS and the images. This probably does what you want (quoting from the manual)
Retrieve only one HTML page, but make sure that all the elements needed for the page to be displayed, such as inline images and external style sheets, are also downloaded. Also make sure the downloaded page references the downloaded links.
wget -p --convert-links http://www.server.com/dir/page.html
Upvotes: 12
Reputation: 193696
You can use the urllib
module to download individual URLs but this will just return the data. It will not parse the HTML and automatically download things like CSS files and images.
If you want to download the "whole" page you will need to parse the HTML and find the other things you need to download. You could use something like Beautiful Soup to parse the HTML you retrieve.
This question has some sample code doing exactly that.
Upvotes: 46
Reputation: 14129
You can use the urlib:
import urllib.request
opener = urllib.request.FancyURLopener({})
url = "http://stackoverflow.com/"
f = opener.open(url)
content = f.read()
Upvotes: 16