user2085779
user2085779

Reputation:

How can I download entire website using urlib?

I need to download entire websiteusing python urlib like

import urllib

site = urllib.urlopen('http://www.mathrubumi.com/index.php')
site_data = site.read()

It downloads only the first page. That is index.php. How can I make the code to download entire website. By looping ?? or Is there any other way? For example in wget looping is not required in the code

wget \ --recursive \--no-clobber \ --page-requisites \ --html-extension \  --convert-links \
     --restrict-file-names=windows \ --domains website.org \    --no-parent \    www.website.org/tutorials/html/

Upvotes: 6

Views: 14145

Answers (3)

Daniel Hepper
Daniel Hepper

Reputation: 29977

If you want to download a complete website with urllib, you'll have to parse every page, find all links and download them too. It's doable, but it can be tricky to get right.

I suggest you either look into scrapy if you want a pure python solution or just call wget from your script.

Upvotes: 8

Torxed
Torxed

Reputation: 23480

Since the user (in another question asked but deleted because.. reasons..) pointed out references to using BeautifulSoup as an alternative, here's a working example to retrieve all <a href="something.html">something</a> links and save them locally:

import urllib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
from os.path import basename

def store_links(page):
    with open(basename(page), 'wb') as fh:
        site = urllib.urlopen(page)
        site_data = site.read()

        fh.write(site_data)

        for link in BeautifulSoup(site_data, parseOnlyThese=SoupStrainer('a')):
            if link.has_attr('href'):
                store_links(link['href'])

store_links('http://www.nytimes.com')

Notice: Havn't tested, currently on a locked down machine so syntax errors might be expected, but the idea is the same:

  1. Create a recursive function that will call itself whenever it finds a link
  2. Give that recursive function a starting-point and let it go nuts

Upvotes: 1

Godinall
Godinall

Reputation: 2290

  1. If you are not using urlencode method then you could use urllib2 which allows you to set your headers and UA. Or you can use requests which supports more API. See documentation here
  2. To use urllib to download the entire website, the website must enable directory listing, for which most site owners will not allow by setting in the .htaccess.

Upvotes: 0

Related Questions