Reputation: 135
I am trying to have a program that returns the image link for the first image of a google search.
The link I am trying to get is if you were click the first image, right clicking the image that appears and then opening the image. The current code I have is.
r = requests.get(theurl)
soup = BeautifulSoup(r.text,"lxml")
link = soup.find('img', class_='irc_mi')['src']
return link
However I get a type error that says "TypeError: 'NoneType' object is not subscriptable".
Upvotes: 0
Views: 4321
Reputation: 1724
You can achieve this using selenium
but the execution time will be slower than using bs4
.
To scrape the original image link using bs4
, you need to parse <script>
tags with regex
and then parse those links.
For example, part of the code (check out full example in the online IDE):
# find all script tags
all_script_tags = soup.select('script')
# find all full res images
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
all_script_tags)
# iterate over found matches and decode them
for fixed_full_res_image in matched_google_full_resolution_images:
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)
-----
'''
https://external-preview.redd.it/mAQWN2kUYgFS3fgm6LfYo37AY7i2e_YY8d83_1jTeys.jpg?auto=webp&s=b2bad0e23cbd83426b06e6a547ef32ebbc08e2d2
https://i.ytimg.com/vi/_mR0JBLXRLY/maxresdefault.jpg
https://wallpaperaccess.com/full/37454.jpg
...
'''
Alternatively, you can achieve this easily by using Google Images API from SerpApi. It's a paid API with a free plan.
The difference is that you don't need to figure out how to scrape something or maintain the parser if something will change over time. All that needs to be done is just to iterate over structured JSON and extract needed data.
Code to integrate:
import os, json
from serpapi import GoogleSearch
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "minecraft shaders 8k photo",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['suggested_searches'], indent=2, ensure_ascii=False))
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
------
'''
[
...
{
"position": 30,
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ_CjA8J1P5Y6bN2KCuY6XgS4mFvctuwhho6A&usqp=CAU",
"source": "wallpaperbetter.com",
"title": "minecraft shaders video games, HD wallpaper | Wallpaperbetter",
"link": "https://www.wallpaperbetter.com/en/hd-wallpaper-cusnk",
"original": "https://p4.wallpaperbetter.com/wallpaper/120/342/446/minecraft-shaders-video-games-wallpaper-preview.jpg",
"is_product": false
}
...
]
'''
I have already answered a similar question here and wrote a dedicated blog about how scrape and download Google Images with Python.
Disclaimer, I work for SerpApi.
Upvotes: 1
Reputation: 3118
It appears that the src
attributes are added due to the JavaScript running in the browser. You can use Requests-HTML
to achieve your goal:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.google.pl/search?q=python&source=lnms&tbm=isch&sa=X&ved=0ahUKEwif6Zq7i8vaAhVMLVAKHUDkDa4Q_AUICigB&biw=1280&bih=681'
r = session.get(url)
r.html.render()
first_image = r.html.find('.rg_ic.rg_i', first=True)
link = first_image.attrs['src']
Upvotes: 2
Reputation: 17094
You have a typo - _class
not class
.
Also - you don't actually need to supply the class name attribute.
r = requests.get(theurl)
soup = BeautifulSoup(r.text, "lxml")
link = soup.find("img", "irc_mi")["src"]
return link
Upvotes: 0