Jonathan Devereux
Jonathan Devereux

Reputation: 175

Downloading all Images from a page with beautifulSoup not working

I am trying to download the show images from this page with beautifulsoup.

When I run the below code the only image that downloads is the spinning loading icon.

When I check the requests tab on the page I can see requests for all the other images on the page so assume they should be downloaded as well. I am not sure why they would not download as they are contained within img tags in the html on the page?

import re
import requests
from bs4 import BeautifulSoup
site = 'https://www.tvnz.co.nz/categories/sci-fi-and-fantasy'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
image_tags = soup.find_all('img')
urls = [img['src'] for img in image_tags]
for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    if not filename:
         print("Regular expression didn't match with the url: {}".format(url))
         continue
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)
print("Download complete, downloaded images can be found in current directory!")

Upvotes: 0

Views: 294

Answers (1)

Driftr95
Driftr95

Reputation: 4710

You can try via the api they seem to be using to populate the page

api_url = 'https://apis-edge-prod.tech.tvnz.co.nz/api/v1/web/play/page/categories/sci-fi-and-fantasy'
r = requests.get(api_url)
try:
    embVals = r.json()['_embedded'].values() 
except Exception as e:
    embVals = []
    print('failed to get embedded items\n', str(e))

urls = [img for images in [ [
    v['src'] for k, v in ev.items() if 
    k is not None and 'image' in k.lower() 
    and v is not None and 'src' in v
] for ev in embVals] for img in images]

# for url in urls: # should work the same

(Images seem to be in nested dictionaries with keys like 'portraitTileImage', 'image', 'tileImage', 'coverImage'. You can also use for-loop/s to go through embVals and extract other data if you want to include more in the filename/metadata/etc.)

I don't know if it will get you ALL the images on the page, but when I tried it, urls had 297 links.

Upvotes: 1

Related Questions