Reputation: 398
I am trying to get all the image urls for all the books on this page https://www.nb.co.za/en/books/0-6-years
with beautiful soup.
This is my code:
from bs4 import BeautifulSoup
import requests
baseurl = "https://www.nb.co.za/"
productlinks = []
r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")
def my_filter(tag):
return (tag.name == 'a' and
tag.parent.name == 'div' and
'img-container' in tag.parent['class'])
for item in productlist:
for link in item.find_all(my_filter, href=True):
productlinks.append(baseurl + link['href'])
cover = soup.find_all('div', class_="img-container")
print(cover)
And this is my output:
<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>
What I hope to get:
https://www.nb.co.za/en/helper/ReadImage/25929.jpg
My problem is:
How do I get the data-sourcre only?
How to I get the extension of the image?
Upvotes: 2
Views: 3145
Reputation: 1724
To wait until all images are loaded you can tell requests
to use timeout
argument or set it to timeout=None
which will tell requests
to wait forever for a response if the page loaded slowly.
The reason why you get a .gif
at the end of the image results is that the image hasn't been loaded yet and a gif was showing that.
You can access data-src
attribute the same way you would access a dictionary: class[attribute]
If you want to save an image locally, you can use urllib.request.urlretrieve
:
import urllib.request
urllib.request.urlretrieve("BOOK_COVER_URL", file_name.jpg) # will save in the current directory
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
response = requests.get(f'https://www.nb.co.za/en/books/0-6-years', timeout=None)
soup = BeautifulSoup(response.text, 'lxml')
for result in soup.select(".img-container"):
link = f'https://www.nb.co.za{result.select_one("a")["href"]}'
# try/except to handle error when there's no image on the website (last 3 results)
try:
image = f'https://www.nb.co.za{result.select_one("a img")["data-src"]}'
except: image = None
print(link, image, sep="\n")
# part of the output:
'''
# first result (Step by Step: Counting to 50)
https://www.nb.co.za/en/view-book/?id=9780798182539
https://www.nb.co.za/en/helper/ReadImage/25929
# last result WITH image preview (Dinosourusse - Feite en geite: Daar’s ’n trikeratops op die trampoline)
https://www.nb.co.za/en/helper/ReadImage/10853
https://www.nb.co.za/en/view-book/?id=9780624035480
# last result (Uhambo lukamusa (isiZulu)) WITH NO image preview on the website as well so it returned None
https://www.nb.co.za/en/view-book/?id=9780624043003
None
'''
Upvotes: -1
Reputation: 25196
1: How do I get the data-source only?
You can access the data-src
by calling element['data-src']
:
cover = baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
2: How to I get the extension of the image?
You can access the extension of the file like diggusbickus mentioned (good approache), but this will not help you if you try to request the file like https://www.nb.co.za/en/helper/ReadImage/25929.jpg this will cause a 404 error.
The image is dynamically loaded / served additional infos -> https://stackoverflow.com/a/5110673/14460824
baseurl = "https://www.nb.co.za/"
nocover = '/Content/images/no-cover.jpg'
data = []
for item in soup.select('.book-slider-frame'):
data.append({
'link' : baseurl+item.a['href'],
'cover' : baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
})
data
[{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182539',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25929'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182546',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25931'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182553',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25925'},...]
Upvotes: 4
Reputation: 2010
i'll show you how to do it for that small example, i'll let you handle the rest. just use the imghdr
module
import imghdr
import requests
from bs4 import BeautifulSoup
data="""<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>"""
soup=BeautifulSoup(data, 'lxml')
base_url="https://www.nb.co.za"
img_src=soup.select_one('img')['data-src']
img_name=img_src.split('/')[-1]
data=requests.get(base_url+img_src)
with open(img_name, 'wb') as f:
f.write(data.content)
print(imghdr.what(img_name))
>>> jpeg
Upvotes: 1