YasserKhalil
YasserKhalil

Reputation: 9538

Download bulk images in python

After watching a video about how to download images using python, I typed the code in the video and here's the code

import pandas as pd
import urllib.request

def url_to_jpg(i, url, file_path):
    filename = 'image-{}.jpg'.format(i)
    fullpath = '{}{}'.format(file_path, filename)
    print(fullpath)
    urllib.request.urlretrieve(url, fullpath)
    print('{} saved.'.format(filename))
    return None

FILENAME = 'Images URLs.csv'
FILE_PATH = 'Images/'
urls = pd.read_csv(FILENAME)

for i, url in enumerate(urls.values):
    url_to_jpg(i, url, FILE_PATH)

When testing the code, I encountered error at this line urllib.request.urlretrieve(url, fullpath) which is like that

Images/image-0.jpg
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-d92ed57d1d8e> in <module>
     15 
     16 for i, url in enumerate(urls.values):
---> 17     url_to_jpg(i, url, FILE_PATH)

<ipython-input-36-d92ed57d1d8e> in url_to_jpg(i, url, file_path)
      6     fullpath = '{}{}'.format(file_path, filename)
      7     print(fullpath)
----> 8     urllib.request.urlretrieve(url, fullpath)
      9     print('{} saved.'.format(filename))
     10     return None

C:\ProgramData\Anaconda3\lib\urllib\request.py in urlretrieve(url, filename, reporthook, data)
    243     data file as well as the resulting HTTPMessage object.
    244     """
--> 245     url_type, path = _splittype(url)
    246 
    247     with contextlib.closing(urlopen(url, data)) as fp:

C:\ProgramData\Anaconda3\lib\urllib\parse.py in _splittype(url)
   1006         _typeprog = re.compile('([^/:]+):(.*)', re.DOTALL)
   1007 
-> 1008     match = _typeprog.match(url)
   1009     if match:
   1010         scheme, data = match.groups()

TypeError: cannot use a string pattern on a bytes-like object

Any ideas about that error?

** I have found the solution to a point which is modifying this line url_to_jpg(i, url[0], FILE_PATH)

But it seems that some of the links are not allowed as I got another error HTTPError: HTTP Error 403: Forbidden How can I overcome this?

** I tried to add headers (agent) as suggested but don't know how to finish it properly. How to use urlretrieve in that case?

import urllib.request

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

response = urllib.request.Request("http://www.gunnerkrigg.com//comics/00000001.jpg", headers=hdr)
print(urllib.request.urlopen(response))
urllib.request.urlretrieve(urllib.request.urlopen(response).read(),'oo.jpg')
#urllib.request.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")

Upvotes: 0

Views: 1236

Answers (1)

Berkay
Berkay

Reputation: 1068

This code will help you overcome for HTTPError: HTTP Error 403: Forbidden

It's header added version of your code.

import pandas as pd
import urllib.request

# build an opener
opener = urllib.request.build_opener()

# add a header for opener
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7')]

# install opener once
urllib.request.install_opener(opener)

def url_to_jpg(i, url, file_path):
    filename = 'image-{}.jpg'.format(i)
    fullpath = '{}{}'.format(file_path, filename)
    print(fullpath)
    urllib.request.urlretrieve(url, fullpath)
    print('{} saved.'.format(filename))
    return None

FILENAME = 'Images URLs.csv'
FILE_PATH = 'Images/'
urls = pd.read_csv(FILENAME)

for i, url in enumerate(urls.values):
    url_to_jpg(i, url[0], FILE_PATH)

Upvotes: 1

Related Questions