Reputation:
I need to download entire websiteusing python urlib like
import urllib
site = urllib.urlopen('http://www.mathrubumi.com/index.php')
site_data = site.read()
It downloads only the first page. That is index.php. How can I make the code to download entire website.
By looping ??
or Is there any other way?
For example in wget
looping is not required in the code
wget \ --recursive \--no-clobber \ --page-requisites \ --html-extension \ --convert-links \
--restrict-file-names=windows \ --domains website.org \ --no-parent \ www.website.org/tutorials/html/
Upvotes: 6
Views: 14145
Reputation: 29977
If you want to download a complete website with urllib
, you'll have to parse every page, find all links and download them too. It's doable, but it can be tricky to get right.
I suggest you either look into scrapy if you want a pure python solution or just call wget
from your script.
Upvotes: 8
Reputation: 23480
Since the user (in another question asked but deleted because.. reasons..) pointed out references to using BeautifulSoup as an alternative, here's a working example to retrieve all <a href="something.html">something</a>
links and save them locally:
import urllib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
from os.path import basename
def store_links(page):
with open(basename(page), 'wb') as fh:
site = urllib.urlopen(page)
site_data = site.read()
fh.write(site_data)
for link in BeautifulSoup(site_data, parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
store_links(link['href'])
store_links('http://www.nytimes.com')
Notice: Havn't tested, currently on a locked down machine so syntax errors might be expected, but the idea is the same:
Upvotes: 1
Reputation: 2290
Upvotes: 0