Rahul Satal
Rahul Satal

Reputation: 387

How to download a full webpage with a Python script?

Currently I have a script that can only download the HTML of a given page.

Now I want to download all the files of the web page including HTML, CSS, JS and image files (same as we get with a ctrl-s of any website).

My current code is:

import urllib
url = "https://en.wikipedia.org/wiki/Python_%28programming_language%29"
urllib.urlretrieve(url, "t3.html")

I visited many questions but they are all only downloading the HTML.

Upvotes: 27

Views: 78653

Answers (4)

imbr
imbr

Reputation: 7632

Using Python 3+ Requests and other standard libraries.

The function savePage receives a requests.Response and the pagefilename where to save it.

  • Saves the pagefilename.html on the current folder
  • Downloads, javascripts, css and images based on the tags script, link and img and saved on a folder pagefilename_files.
  • Any exception are printed on sys.stderr, returns a BeautifulSoup object .
  • Requests session must be a global variable unless someone writes a cleaner code here for us.

You can adapt it to your needs.


import os, sys
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def soupfindAllnSave(pagefolder, url, soup, tag2find='img', inner='src'):
    if not os.path.exists(pagefolder): # create only once
        os.mkdir(pagefolder)
    for res in soup.findAll(tag2find):   # images, css, etc..
        try:
            filename = os.path.basename(res[inner])  
            fileurl = urljoin(url, res.get(inner))
            # rename to saved file path
            # res[inner] # may or may not exist 
            filepath = os.path.join(pagefolder, filename)
            res[inner] = os.path.join(os.path.basename(pagefolder), filename)
            if not os.path.isfile(filepath): # was not downloaded
                with open(filepath, 'wb') as file:
                    filebin = session.get(fileurl)
                    file.write(filebin.content)
        except Exception as exc:      
            print(exc, file=sys.stderr)
    return soup

def savePage(response, pagefilename='page'):    
   url = response.url
   soup = BeautifulSoup(response.text)
   pagefolder = pagefilename+'_files' # page contents 
   soup = soupfindAllnSave(pagefolder, url, soup, 'img', inner='src')
   soup = soupfindAllnSave(pagefolder, url, soup, 'link', inner='href')
   soup = soupfindAllnSave(pagefolder, url, soup, 'script', inner='src')    
   with open(pagefilename+'.html', 'w') as file:
      file.write(soup.prettify())
   return soup

Example saving google page and its contents (google_files folder)

session = requests.Session()
#... whatever requests config you need here
response = session.get('https://www.google.com')
savePage(response, 'google')

Upvotes: 7

Sam Al-Ghammari
Sam Al-Ghammari

Reputation: 1021

The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth variable for you to set the maximum sub_websites that you want to parse to.

import urllib2
from BeautifulSoup import *
from urlparse import urljoin


def crawl(pages, depth=None):
    indexed_url = [] # a list for the main and sub-HTML websites in the main website
    for i in range(depth):
        for page in pages:
            if page not in indexed_url:
                indexed_url.append(page)
                try:
                    c = urllib2.urlopen(page)
                except:
                    print "Could not open %s" % page
                    continue
                soup = BeautifulSoup(c.read())
                links = soup('a') #finding all the sub_links
                for link in links:
                    if 'href' in dict(link.attrs):
                        url = urljoin(page, link['href'])
                        if url.find("'") != -1:
                                continue
                        url = url.split('#')[0] 
                        if url[0:4] == 'http':
                                indexed_url.append(url)
        pages = indexed_url
    return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=2)
print urls

Python3 version, 2019. May this saves some time to somebody:

#!/usr/bin/env python


import urllib.request as urllib2
from bs4 import *
from urllib.parse  import urljoin


def crawl(pages, depth=None):
    indexed_url = [] # a list for the main and sub-HTML websites in the main website
    for i in range(depth):
        for page in pages:
            if page not in indexed_url:
                indexed_url.append(page)
                try:
                    c = urllib2.urlopen(page)
                except:
                    print( "Could not open %s" % page)
                    continue
                soup = BeautifulSoup(c.read())
                links = soup('a') #finding all the sub_links
                for link in links:
                    if 'href' in dict(link.attrs):
                        url = urljoin(page, link['href'])
                        if url.find("'") != -1:
                                continue
                        url = url.split('#')[0] 
                        if url[0:4] == 'http':
                                indexed_url.append(url)
        pages = indexed_url
    return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=1)
print( urls )

Upvotes: 23

rajatomar788
rajatomar788

Reputation: 467

You can easily do that with simple python library pywebcopy.

For Current version: 5.0.1


from pywebcopy import save_webpage

url = 'http://some-site.com/some-page.html'
download_folder = '/path/to/downloads/'    

kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'}

save_webpage(url, download_folder, **kwargs)

You will have html, css, js all at your download_folder. Completely working like original site.

Upvotes: 21

Try the Python library Scrapy. You can program Scrapy to recursively scan a website by downloading its pages, scanning, following links:

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

Upvotes: 2

Related Questions