Reputation: 1142
I'm trying to scrape google images. While beautiful soup extracts 'src' it outputs links
data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
which is not the actual image. The script tag looks heavily encoded and doesn't contain the actual URI. Can anybody suggest me a solution?
Actually this is minified data URI which when decoded yields a 1x1 image. My question is how google minifies complete data URI and how can we access the full URI so that we can get the actual image?
Upvotes: 0
Views: 699
Reputation: 1424
Google Images are inserted to DOM from (thankfully) inline JavaScript. Open a page source of search results for any query, copy the image src
attribute, and find it in the page source.
To extract it with bs4
only, you can mimic the browser and extract data from inline JavaScript with regular expressions.
Alternatively, you can use SerpApi to extract URIs of full images. It's a paid SaaS with a free trial.
Example usage with curl
.
curl -s 'https://serpapi.com/search?q=coffee&tbm=isch'
Example usage with google-search-results
Python package on Repl.it.
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "coffee",
"tbm": "isch",
"api_key": os.getenv("API_KEY")
}
client = GoogleSearch(params)
data = client.get_dict()
print("Images results")
for result in data['images_results']:
print(f"""
Position: {result['position']}
Original image: {result['original']}
""")
Example output
Images results
Position: 1
Original image: https://upload.wikimedia.org/wikipedia/commons/4/45/A_small_cup_of_coffee.JPG
Position: 2
Original image: https://media3.s-nbcnews.com/j/newscms/2019_33/2203981/171026-better-coffee-boost-se-329p_67dfb6820f7d3898b5486975903c2e51.fit-1240w.jpg
Check documentation for Google Images API on SerpApi website.
Disclaimer: I work at SerpApi.
Upvotes: 0
Reputation: 15498
That's the image in Base64 encoding. You can save it to a image file like:
src = "BASE64 DATA"
img = open("MyImage.gif","wb+")
img.write(src.decode('base64'))
img.close()
Upvotes: 1
Reputation: 56
this is data URL, please refer https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs
you can decode the base64 string then save to a image file.
Upvotes: 1