Reputation: 125
I am trying to get urls of images in a webpage using Python and BeautifulSoup4
My current code is
import requests
from bs4 import BeautifulSoup
url="https://goibibo.com/hotels/hotels-in-shimla-ct/"
#Headers
headers={
'User-Agent':"Mozilla/5.0 (x11; Linux x86_64) AppleWebkit/537.36 (KHTML, like Gecko Chrome 77.0.3865.90 Safari/537.36)"
}
data = requests.get(url,headers=headers).text
soup = BeautifulSoup(data, 'html.parser')
images = soup.find_all('img',src=True)
print('Number of Images: ', len(images))
print('\n')
for image in images:
if(image.has_attr('src')):
print(image['src'])
When I inspect the image element it has a proper URL (src="https://cdn1.goibibo.com/voy_ing/t_g/812aa1726b8211e7a0a10a4cef95d023.jpg"). However, when I get the src value of the img element using BeautifulSoup4, it returns data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
How can I get the image url given in the webpage?
Upvotes: 2
Views: 5860
Reputation: 4472
This img tag does not contain reference to some image url, it contains image itself in base64 form (see w3docs for example).
To decode it you'll need to get string after base64,
:
string = string.split('base64,')[1]
Than decode it to byte array:
import base64
decoded = base64.decodebytes(string.encode("ascii"))
And that byte array can be written to file:
with open('output.gif', 'wb') as f:
f.write(decoded)
Generally it should be a bit more complicated, because you'll need to consider picture format supplied in the beginning of the data URI data:image/gif;
(it can be png, jpg, svg as well), but that also shouldn't be very complicated.
Upvotes: 3