Reputation: 398

Get image data-src with Beautiful Soup when there is no image extension

I am trying to get all the image urls for all the books on this page https://www.nb.co.za/en/books/0-6-years with beautiful soup.

This is my code:

from bs4 import BeautifulSoup
import requests

baseurl = "https://www.nb.co.za/"
productlinks = []

r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")

def my_filter(tag):
    return (tag.name == 'a' and
        tag.parent.name == 'div' and
        'img-container' in tag.parent['class'])

for item in productlist:
    for link in item.find_all(my_filter, href=True):
        productlinks.append(baseurl + link['href'])

        cover = soup.find_all('div', class_="img-container")
        print(cover)

And this is my output:

<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>

What I hope to get:

https://www.nb.co.za/en/helper/ReadImage/25929.jpg

My problem is:

How do I get the data-sourcre only?
How to I get the extension of the image?

Upvotes: 2

Answers (3)

Dmitriy Zub

Reputation: 1724

To wait until all images are loaded you can tell requests to use timeout argument or set it to timeout=None which will tell requests to wait forever for a response if the page loaded slowly.

The reason why you get a .gif at the end of the image results is that the image hasn't been loaded yet and a gif was showing that.

You can access data-src attribute the same way you would access a dictionary: class[attribute]

If you want to save an image locally, you can use urllib.request.urlretrieve:

import urllib.request

urllib.request.urlretrieve("BOOK_COVER_URL", file_name.jpg) # will save in the current directory

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

response = requests.get(f'https://www.nb.co.za/en/books/0-6-years', timeout=None)
soup = BeautifulSoup(response.text, 'lxml')

for result in soup.select(".img-container"):
    link = f'https://www.nb.co.za{result.select_one("a")["href"]}'

    # try/except to handle error when there's no image on the website (last 3 results)
    try:
        image = f'https://www.nb.co.za{result.select_one("a img")["data-src"]}'
    except: image = None

    print(link, image, sep="\n")


# part of the output:
'''
# first result (Step by Step: Counting to 50)
https://www.nb.co.za/en/view-book/?id=9780798182539
https://www.nb.co.za/en/helper/ReadImage/25929

# last result WITH image preview (Dinosourusse - Feite en geite: Daar’s ’n trikeratops op die trampoline)
https://www.nb.co.za/en/helper/ReadImage/10853
https://www.nb.co.za/en/view-book/?id=9780624035480

# last result (Uhambo lukamusa (isiZulu)) WITH NO image preview on the website as well so it returned None
https://www.nb.co.za/en/view-book/?id=9780624043003
None
'''

Upvotes: -1

HedgeHog

Reputation: 25196

1: How do I get the data-source only?

You can access the data-src by calling element['data-src']:

cover = baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover

2: How to I get the extension of the image?

You can access the extension of the file like diggusbickus mentioned (good approache), but this will not help you if you try to request the file like https://www.nb.co.za/en/helper/ReadImage/25929.jpg this will cause a 404 error.

The image is dynamically loaded / served additional infos -> https://stackoverflow.com/a/5110673/14460824

Example

baseurl = "https://www.nb.co.za/"
nocover = '/Content/images/no-cover.jpg'
data = []

for item in soup.select('.book-slider-frame'):
    
    data.append({
        'link' : baseurl+item.a['href'],
        'cover' : baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
    })
    
data

Output

[{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182539',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25929'},
 {'link': 'https://www.nb.co.za//en/view-book/?id=9780798182546',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25931'},
 {'link': 'https://www.nb.co.za//en/view-book/?id=9780798182553',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25925'},...]

Upvotes: 4

folen gateis

Reputation: 2010

i'll show you how to do it for that small example, i'll let you handle the rest. just use the imghdr module

import imghdr

import requests
from bs4 import BeautifulSoup

data="""<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>"""
soup=BeautifulSoup(data, 'lxml')
base_url="https://www.nb.co.za"
img_src=soup.select_one('img')['data-src']
img_name=img_src.split('/')[-1]
data=requests.get(base_url+img_src)
with open(img_name, 'wb') as f:
    f.write(data.content)

print(imghdr.what(img_name))
>>> jpeg

Upvotes: 1

Get image data-src with Beautiful Soup when there is no image extension

Answers (3)

Example

Output

Related Questions