geoabram
geoabram

Reputation: 125

Get image url with BeautifulSoup where src= data:image/gif;base64,

I am trying to get urls of images in a webpage using Python and BeautifulSoup4

My current code is


import requests

from bs4 import BeautifulSoup

url="https://goibibo.com/hotels/hotels-in-shimla-ct/"

#Headers

headers={
    'User-Agent':"Mozilla/5.0 (x11; Linux x86_64) AppleWebkit/537.36 (KHTML, like Gecko Chrome 77.0.3865.90 Safari/537.36)"
}

data = requests.get(url,headers=headers).text
soup = BeautifulSoup(data, 'html.parser')

images = soup.find_all('img',src=True)

print('Number of Images: ', len(images))
print('\n')
for image in images:
    if(image.has_attr('src')):
        print(image['src'])

When I inspect the image element it has a proper URL (src="https://cdn1.goibibo.com/voy_ing/t_g/812aa1726b8211e7a0a10a4cef95d023.jpg"). However, when I get the src value of the img element using BeautifulSoup4, it returns data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7

How can I get the image url given in the webpage?

Upvotes: 2

Views: 5860

Answers (1)

Alexandra Dudkina
Alexandra Dudkina

Reputation: 4472

This img tag does not contain reference to some image url, it contains image itself in base64 form (see w3docs for example).

To decode it you'll need to get string after base64,:

string = string.split('base64,')[1]

Than decode it to byte array:

import base64
decoded = base64.decodebytes(string.encode("ascii"))

And that byte array can be written to file:

with open('output.gif', 'wb') as f:
    f.write(decoded)

Generally it should be a bit more complicated, because you'll need to consider picture format supplied in the beginning of the data URI data:image/gif; (it can be png, jpg, svg as well), but that also shouldn't be very complicated.

Upvotes: 3

Related Questions