Jingo
Jingo

Reputation: 778

How to retrieve a webpage in python, including any images

I'm trying to retrieve the source of a webpage, including any images. At the moment I have this:

import urllib

page = urllib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php')
print urlgot.php

which retrieves the source fine, but I also need to download any linked images.

I was thinking I could create a regular expression which searched for img src or similar in the downloaded source; however, I was wondering if there was urllib function that would retrieve the images as well? Similar to the wget command of:

wget -r --no-parent http://127.0.0.1/myurl.php

I don't want to use the os module and run the wget, as I want the script to run on all systems. For this reason I can't use any third party modules either.

Any help is much appreciated! Thanks

Upvotes: 8

Views: 3457

Answers (2)

Gringo Suave
Gringo Suave

Reputation: 31880

Don't use regex when there is a perfectly good parser built in to Python:

from urllib.request import urlretrieve  # Py2: from urllib
from html.parser import HTMLParser      # Py2: from HTMLParser

base_url = 'http://127.0.0.1/'

class ImgParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.downloads = []
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            for attr in attrs:
                if attr[0] == 'src':
                    self.downloads.append(attr[1])

parser = ImgParser()
with open('test.html') as f:
    # instead you could feed it the original url obj directly
    parser.feed(f.read())

for path in parser.downloads:
    url = base_url + path
    print(url)
    urlretrieve(url, path)

Upvotes: 7

Marcelo Cantos
Marcelo Cantos

Reputation: 185852

Use BeautifulSoup to parse the returned HTML and search for image links. You might also need to recursively fetch frames and iframes.

Upvotes: 3

Related Questions