J Doe
J Doe

Reputation: 113

Relative path to Absolute paths of images scraped from websites in Python

Iv scraped a website for images which will then be downloaded, however to be able to download them i need to find the absolute path of the images as this is what iv managed to scrape:

2001.JPG big.jpg pics.gif gchq.jpg

all of these images are stored in the variable images im looking for one function which could find all of the absolute paths at once and store them in a variable?

This is the code i use to scrape the images:

images = re.findall(r'src=[\"|\']([^\"|\']+)[\"|\']',webpage.decode())

(i've had a look at various other similar questions on here but none seem to do multiple images at once)

If anyone could point me in the right direction that would be great and any suggestions for the downloading of them as well.

Upvotes: 0

Views: 1744

Answers (1)

Vivek Harikrishnan
Vivek Harikrishnan

Reputation: 866

With BeautifulSoup & urllib you should be able to collect the images in a webpage, iterate and download them.

from urllib import urlretrieve
import urlparse
from bs4 import BeautifulSoup
import urllib2

url = "<your_url>"
soup = BeautifulSoup(urllib2.urlopen(url))
for img in soup.select('img'):
    img_url = urlparse.urljoin(url, img['src'])
    file_name = img['src'].split('/')[-1]
    urlretrieve(img_url, file_name)

Python 3 compatible code,

from bs4 import BeautifulSoup
from urllib.request import urlopen, urlretrieve
from urllib.parse import urljoin

url = "<url>"
soup = BeautifulSoup(urlopen(url))

for img in soup.find_all('img'):
    img_url = urljoin(url, img['src'])
    file_name = img['src'].split('/')[-1]
    urlretrieve(img_url, file_name)

Upvotes: 3

Related Questions