Reputation: 91
This script is suppose to take command line string and run it through the google search engine and then if results are found it will open up the first 5 in different tabs. I am having some issues trying to get it to work. I think the problem is happening towards the bottom where it says link = soup.select(".r a")
, I have been altering the values here and then it will show the next line with an actual length. But running it like this shows the length to still be 0. I am trying to scrape the .r class and a tag because that seems to be where the searched results start on the google result source code.
import requests
import bs4
import sys
import webbrowser
print("Googling...")
response = requests.get("https://www.google.com/#q=" + " ".join(sys.argv[1:]))
response.raise_for_status()
'''Function to return the top search result links'''
soup = bs4.BeautifulSoup(response.text, "html.parser")
'''Open a browser tab for each result'''
links = soup.select(".r a")
print(len(links))
numOpen = min(5, len(links))
for i in range(numOpen):
webbrowser.open("https://google.com/#q=" + links[i].get("href"))
Upvotes: 4
Views: 4412
Reputation: 1724
Instead of using min(5, len(links))
you can use slicing:
links = soup.select('.r a')[:5]
# or
for i in soup.select('.r a')[:5]:
# other code..
Also, you can use find_all()
limit
argument.
Make sure you're using user-agent
because default requests
user-agent
is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent
fakes user visit by adding this information into HTTP request headers.
I wrote a dedicated blog about how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean",
"gl": "us",
"hl": "en",
"num": "100"
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc')[:5]:
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
--------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to figure out how to pick the correct selector or how to bypass blocks from search engines since it's already done for the end-user. All that really needs to be done is to iterate over structured JSON and get the data you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result['title'])
print(result['link'])
---------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 381
You are right! The problem should be resulted from select(".r a")
I suggest you try find_all('a',{"data-uch":1})
, which will find all a tags with attribute data-uch = 1
Explanation:
"If you look up a little from the element, though, there is an element like this: . Looking through the rest of the HTML source, it looks like the r class is used only for search result links."
The sentence above is from the book. However, in real, if you print this soup variable, soup = bs4.BeautifulSoup(response.text, "html.parser")
, you will not find any <h3 class="r">`` in the HTML source code. That is why
print(len(links))``` always show 0.
Upvotes: 0
Reputation: 1063
Your logic is right except the URL for google search is not right.
It's gotta be
response = requests.get("https://www.google.com/search?q=" + " ".join(sys.argv[1:]))
...
for i in range(numOpen):
webbrowser.open("https://www.google.com" + links[i].get("href"))
Here is the full code:
import requests
import bs4
import sys
import webbrowser
print("Googling...")
response = requests.get("https://www.google.com/search?q=" + " ".join(sys.argv[1:]))
response.raise_for_status()
'''Function to return the top search result links'''
soup = bs4.BeautifulSoup(response.text, "html.parser")
'''Open a browser tab for each result'''
links = soup.select(".r a")
print(len(links))
numOpen = min(5, len(links))
for i in range(numOpen):
webbrowser.open("https://www.google.com" + links[i].get("href"))
Upvotes: 2