Reputation: 113
Iv scraped a website for images which will then be downloaded, however to be able to download them i need to find the absolute path of the images as this is what iv managed to scrape:
2001.JPG
big.jpg
pics.gif
gchq.jpg
all of these images are stored in the variable images
im looking for one function which could find all of the absolute paths at once and store them in a variable?
This is the code i use to scrape the images:
images = re.findall(r'src=[\"|\']([^\"|\']+)[\"|\']',webpage.decode())
(i've had a look at various other similar questions on here but none seem to do multiple images at once)
If anyone could point me in the right direction that would be great and any suggestions for the downloading of them as well.
Upvotes: 0
Views: 1744
Reputation: 866
With BeautifulSoup & urllib you should be able to collect the images in a webpage, iterate and download them.
from urllib import urlretrieve
import urlparse
from bs4 import BeautifulSoup
import urllib2
url = "<your_url>"
soup = BeautifulSoup(urllib2.urlopen(url))
for img in soup.select('img'):
img_url = urlparse.urljoin(url, img['src'])
file_name = img['src'].split('/')[-1]
urlretrieve(img_url, file_name)
Python 3 compatible code,
from bs4 import BeautifulSoup
from urllib.request import urlopen, urlretrieve
from urllib.parse import urljoin
url = "<url>"
soup = BeautifulSoup(urlopen(url))
for img in soup.find_all('img'):
img_url = urljoin(url, img['src'])
file_name = img['src'].split('/')[-1]
urlretrieve(img_url, file_name)
Upvotes: 3