Mridul Sachan
Mridul Sachan

Reputation: 93

How to make image crawler which can download images with their respective URLs

I'm working on a project where I need a dataset of Images available on the Internet and their URLs. For this, I have to download a few thousand no. of images. So, I'm planning to download the images from image hosting sites like https://www.pexels.com/, https://pixabay.com/ and few other similar sites like Flickr.

"""
dumpimages.py
    Downloads all the images on the supplied URL, and saves them to the
    specified output file ("/test/" by default)

Usage:
    python dumpimages.py http://example.com/ [output]
"""
from bs4 import BeautifulSoup as bs
from urllib.request import (
    urlopen, urlparse, urlunparse, urlretrieve)
import os
import sys

def main(url, out_folder="/test/"):
    """Downloads all the images at 'url' to /test/"""
    soup = bs(urlopen(url))
    parsed = list(urlparse(url))

    for image in soup.findAll("img"):
        print("Image: %(src)s" % image)
        filename = image["src"]
        # filename = filename.replace("/","|")
        filename = image["src"].split("/")[-1]
        parsed[2] = image["src"]
        outpath = os.path.join(out_folder, filename)
        if image["src"].lower().startswith("http"):
            urlretrieve(image["src"], outpath)
        else:
            urlretrieve(urlunparse(parsed), outpath)

def _usage():
    print("usage: python imgcrawl.py http://example.com [outpath]")

if __name__ == "__main__":
    url = sys.argv[-1]
    out_folder = "/test/"
    if not url.lower().startswith("http"):
        out_folder = sys.argv[-1]
        url = sys.argv[-2]
        if not url.lower().startswith("http"):
            _usage()
            sys.exit(-1)
    main(url, out_folder)

For, this I have written a simple python script as shown above which fetches all the images available in a web page on giving web page URL as input, but I want to make it in such a way that, if I give homepage then it can download all the images available on that site. If any other alternative is there to get the images with their URL data then I will be very much thankful for the help.

Upvotes: 1

Views: 7665

Answers (1)

Tuhin Bepari
Tuhin Bepari

Reputation: 745

Really happy to say that i did exactly same in Python. Have a look to my repo in github https://github.com/digitaldreams/image-crawler-python

Upvotes: 1

Related Questions