Reputation: 2564
I am trying to download all images of a particular wikipedia page. Here is the code snippet
from bs4 import BeautifulSoup as bs
import urllib2
import urlparse
from urllib import urlretrieve
site="http://en.wikipedia.org/wiki/Pune"
hdr= {'User-Agent': 'Mozilla/5.0'}
outpath=""
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup =bs(page)
tag_image=soup.findAll("img")
for image in tag_image:
print "Image: %(src)s" % image
urlretrieve(image["src"], "/home/mayank/Desktop/test")
While after running the program I see error with following stack
Image: //upload.wikimedia.org/wikipedia/commons/thumb/0/04/Pune_Montage.JPG/250px-Pune_Montage.JPG
Traceback (most recent call last):
File "download_images.py", line 15, in <module>
urlretrieve(image["src"], "/home/mayank/Desktop/test")
File "/usr/lib/python2.7/urllib.py", line 93, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "/usr/lib/python2.7/urllib.py", line 239, in retrieve
fp = self.open(url, data)
File "/usr/lib/python2.7/urllib.py", line 207, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 460, in open_file
return self.open_ftp(url)
File "/usr/lib/python2.7/urllib.py", line 543, in open_ftp
ftpwrapper(user, passwd, host, port, dirs)
File "/usr/lib/python2.7/urllib.py", line 864, in __init__
self.init()
File "/usr/lib/python2.7/urllib.py", line 870, in init
self.ftp.connect(self.host, self.port, self.timeout)
File "/usr/lib/python2.7/ftplib.py", line 132, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout)
File "/usr/lib/python2.7/socket.py", line 571, in create_connection
raise err
IOError: [Errno ftp error] [Errno 111] Connection refused
please help on what is causing this error?
Upvotes: 0
Views: 695
Reputation: 298106
//
is shorthand for the current protocol. It seems like Wikipedia is using the shorthand, so you have to explicitly specify HTTP instead of FTP (which Python is assuming for some reason):
for image in tag_image:
src = 'http:' + image
Upvotes: 1