Reputation: 135

Trying to extract the source link of the first image of a google search using beautifulsoup4

I am trying to have a program that returns the image link for the first image of a google search.

The link I am trying to get is if you were click the first image, right clicking the image that appears and then opening the image. The current code I have is.

r = requests.get(theurl)
soup = BeautifulSoup(r.text,"lxml")
link = soup.find('img', class_='irc_mi')['src']
return link

However I get a type error that says "TypeError: 'NoneType' object is not subscriptable".

Upvotes: 0

Answers (3)

Dmitriy Zub

Reputation: 1724

You can achieve this using selenium but the execution time will be slower than using bs4.

To scrape the original image link using bs4, you need to parse <script> tags with regex and then parse those links.

For example, part of the code (check out full example in the online IDE):

# find all script tags
all_script_tags = soup.select('script')

# find all full res images
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                    all_script_tags)

# iterate over found matches and decode them
for fixed_full_res_image in matched_google_full_resolution_images:
    original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
    original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
    print(original_size_img)

-----
'''
https://external-preview.redd.it/mAQWN2kUYgFS3fgm6LfYo37AY7i2e_YY8d83_1jTeys.jpg?auto=webp&s=b2bad0e23cbd83426b06e6a547ef32ebbc08e2d2
https://i.ytimg.com/vi/_mR0JBLXRLY/maxresdefault.jpg
https://wallpaperaccess.com/full/37454.jpg
...
'''

Alternatively, you can achieve this easily by using Google Images API from SerpApi. It's a paid API with a free plan.

The difference is that you don't need to figure out how to scrape something or maintain the parser if something will change over time. All that needs to be done is just to iterate over structured JSON and extract needed data.

Code to integrate:

import os, json
from serpapi import GoogleSearch

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google",
  "q": "minecraft shaders 8k photo",
  "tbm": "isch"
}

search = GoogleSearch(params)
results = search.get_dict()

print(json.dumps(results['suggested_searches'], indent=2, ensure_ascii=False))
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))

------
'''
[
...
  {
    "position": 30,
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ_CjA8J1P5Y6bN2KCuY6XgS4mFvctuwhho6A&usqp=CAU",
    "source": "wallpaperbetter.com",
    "title": "minecraft shaders video games, HD wallpaper | Wallpaperbetter",
    "link": "https://www.wallpaperbetter.com/en/hd-wallpaper-cusnk",
    "original": "https://p4.wallpaperbetter.com/wallpaper/120/342/446/minecraft-shaders-video-games-wallpaper-preview.jpg",
    "is_product": false
  }
...
]
'''

I have already answered a similar question here and wrote a dedicated blog about how scrape and download Google Images with Python.

Disclaimer, I work for SerpApi.

Upvotes: 1

radzak

Reputation: 3118

It appears that the src attributes are added due to the JavaScript running in the browser. You can use Requests-HTML to achieve your goal:

from requests_html import HTMLSession

session = HTMLSession()
url = 'https://www.google.pl/search?q=python&source=lnms&tbm=isch&sa=X&ved=0ahUKEwif6Zq7i8vaAhVMLVAKHUDkDa4Q_AUICigB&biw=1280&bih=681'
r = session.get(url)
r.html.render()

first_image = r.html.find('.rg_ic.rg_i', first=True)
link = first_image.attrs['src']

Upvotes: 2

Fraser

Reputation: 17094

You have a typo - _class not class.

Also - you don't actually need to supply the class name attribute.

r = requests.get(theurl)
soup = BeautifulSoup(r.text, "lxml")
link = soup.find("img", "irc_mi")["src"]
return link

Upvotes: 0

Trying to extract the source link of the first image of a google search using beautifulsoup4

Answers (3)

Related Questions