Chiranjeevi Kandel
Chiranjeevi Kandel

Reputation: 1142

Scrape src attribute from google with beautiful soup only

I'm trying to scrape google images. While beautiful soup extracts 'src' it outputs links

data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== 

which is not the actual image. The script tag looks heavily encoded and doesn't contain the actual URI. Can anybody suggest me a solution?

Actually this is minified data URI which when decoded yields a 1x1 image. My question is how google minifies complete data URI and how can we access the full URI so that we can get the actual image?

Upvotes: 0

Views: 699

Answers (3)

ilyazub
ilyazub

Reputation: 1424

Google Images are inserted to DOM from (thankfully) inline JavaScript. Open a page source of search results for any query, copy the image src attribute, and find it in the page source.

To extract it with bs4 only, you can mimic the browser and extract data from inline JavaScript with regular expressions.

Page source of Google Images results for "stackoverflow" search query

Alternatively, you can use SerpApi to extract URIs of full images. It's a paid SaaS with a free trial.

Example usage with curl.

curl -s 'https://serpapi.com/search?q=coffee&tbm=isch'

Example usage with google-search-results Python package on Repl.it.

from serpapi import GoogleSearch
import os

params = {
    "engine": "google",
    "q": "coffee",
    "tbm": "isch",
    "api_key": os.getenv("API_KEY")
}

client = GoogleSearch(params)
data = client.get_dict()

print("Images results")

for result in data['images_results']:
    print(f"""
Position: {result['position']}
Original image: {result['original']}
""")

Example output

Images results

Position: 1
Original image: https://upload.wikimedia.org/wikipedia/commons/4/45/A_small_cup_of_coffee.JPG


Position: 2
Original image: https://media3.s-nbcnews.com/j/newscms/2019_33/2203981/171026-better-coffee-boost-se-329p_67dfb6820f7d3898b5486975903c2e51.fit-1240w.jpg

Check documentation for Google Images API on SerpApi website.

Disclaimer: I work at SerpApi.

Upvotes: 0

wasif
wasif

Reputation: 15498

That's the image in Base64 encoding. You can save it to a image file like:

src = "BASE64 DATA"
img = open("MyImage.gif","wb+")
img.write(src.decode('base64'))
img.close()

Upvotes: 1

David Chen
David Chen

Reputation: 56

this is data URL, please refer https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs

you can decode the base64 string then save to a image file.

Upvotes: 1

Related Questions