Reputation: 175
I am trying to download the show images from this page with beautifulsoup.
When I run the below code the only image that downloads is the spinning loading icon.
When I check the requests tab on the page I can see requests for all the other images on the page so assume they should be downloaded as well. I am not sure why they would not download as they are contained within img tags in the html on the page?
import re
import requests
from bs4 import BeautifulSoup
site = 'https://www.tvnz.co.nz/categories/sci-fi-and-fantasy'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
image_tags = soup.find_all('img')
urls = [img['src'] for img in image_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
if not filename:
print("Regular expression didn't match with the url: {}".format(url))
continue
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
print("Download complete, downloaded images can be found in current directory!")
Upvotes: 0
Views: 294
Reputation: 4710
You can try via the api they seem to be using to populate the page
api_url = 'https://apis-edge-prod.tech.tvnz.co.nz/api/v1/web/play/page/categories/sci-fi-and-fantasy'
r = requests.get(api_url)
try:
embVals = r.json()['_embedded'].values()
except Exception as e:
embVals = []
print('failed to get embedded items\n', str(e))
urls = [img for images in [ [
v['src'] for k, v in ev.items() if
k is not None and 'image' in k.lower()
and v is not None and 'src' in v
] for ev in embVals] for img in images]
# for url in urls: # should work the same
(Images seem to be in nested dictionaries with keys like 'portraitTileImage', 'image', 'tileImage', 'coverImage'. You can also use for-loop/s to go through embVals
and extract other data if you want to include more in the filename/metadata/etc.)
I don't know if it will get you ALL the images on the page, but when I tried it, urls
had 297 links.
Upvotes: 1