Reputation: 93
So I've been working through Al Sweigart's online Automate The Boring Stuff With Python tutorials, and I've just got to the webscraping part. Here's my code with a description of what the program is supposed to do:
#! python3
# lucky.py - A small program that allows you to get search keywords from
# command line arguments, retrieve the search results page, and open
# a new browser tab for each result
# Steps:
# 1. Read the command line arguments from sys.argv
# 2. Fetch the search result page with the requests module
# 3. Find the links to each search result
# 4. Call the webbrowser.open() function to open the web browser
import sys, requests, bs4, webbrowser
# 1. Read the command line arguments from sys.argv
print('Googling...')
if len(sys.argv) > 1:
search = ' '.join(sys.argv[1:])
url = "https://www.google.com/#q="
for i in range(len(search.split())):
url += search.split()[i] + "+"
# 2. Fetch the search result page with the requests module
page = requests.get(url)
# 3. Find the links to each search result
soup = bs4.BeautifulSoup(page.text, 'lxml')
linkElems = soup.select('.r a')
# 4. Call the webbrowser.open() function to open the web browser
numOpen = min(5, len(linkElems))
for i in range(numOpen):
webbrowser.open("http://google.com" + linkElems[i].get('href'))
So the issue here is that when I check the length of linkElems, it's 0, meaning that the soup.select('.r a') command failed to aggregate the content defined under element <a> inside class=r (a class only used for search results in Google as can be seen when using the developer tools). As a result, no web pages of the search results open up in my browser.
I think the issue has something to do either with the HTML-parser not working correctly, or Google changing the way their HTML code works(?). Any insight into this issue would be greatly appreciated!
Upvotes: 0
Views: 709
Reputation: 1724
There's no need in phantomjs
or selenium
. Also, query param
is wrong: #q=
-> ?q=
.
You can limit selected links by using list
slicing:
linkElems = soup.select('.r a')[:5]
# or
for i in soup.select('.r a')[:5]:
# other code..
Make sure you're using user-agent
because default requests
user-agent
is python-requests
thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent
fakes user visit by adding this information into HTTP request headers.
I wrote a dedicated blog about how to reduce chance of being blocked while web scraping search engines that cover multiple solutions.
Pass user-agent
in request headers
:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
Code:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean",
"gl": "us",
"hl": "en"
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc')[:5]:
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
--------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want rather than making everything from scratch and maintain it over time.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result['title'])
print(result['link'])
---------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 432
Google seems to be detecting that you're a bot and not a real webbrowser with Cookies and Javascript. What they seem to be trying to do with the new results is still get web scrapers to follow the links they provide and prefix them with https://www.google.com so that when you then go to that URL, they can still track your movement.
You could also try to find a pattern in the link provided. For instance, when you search for 'linux', it returns the following:
/url?q=https://en.wikipedia.org/wiki/Linux&sa=U&ved=9775308e-206b-11e8-b45f-fb72cae612a8&usg=9775308e-206b-11e8-b45f-fb72cae612a8
/url?q=https://www.linux.org/&sa=U&ved=9775308e-206b-11e8-b45f-fb72cae612a8&usg=9775308e-206b-11e8-b45f-fb72cae612a8
/url?q=https://www.linux.com/what-is-linux&sa=U&ved=9775308e-206b-11e8-b45f-fb72cae612a8&usg=d50ea51a-206b-11e8-9432-2bee635f8337
/url?q=https://www.ubuntu.com/&sa=U&ved=9775308e-206b-11e8-b45f-fb72cae612a8&usg=dab9f6a4-206b-11e8-a999-3fc9d4576425
/search?q=linux&ie=UTF-8&prmd=ivns&source=univ&tbm=nws&tbo=u&sa=X&ved=9775308e-206b-11e8-b45f-fb72cae612a8
You could use a regex to grab the part between '/url?q=' and '&sa=U&ved=' as that's the URL that you probably want. Of course, that doesn't work with the 5th result that it returned because it's something special for the Google website. Again, Probably tacking https://www.google.com on the front of each URL returned is the safest thing to do.
Most search engines (even duckduckgo.com) are trying to track search results and clicks. If you try to avoid it they have detection code in place to block you. You may have run into this with Google telling you they've detected a large number of searches from your IP and you have to go through a CAPTCHA test to continue.
Upvotes: 1
Reputation: 8636
linkElems = soup.find_all('a',href=True)
This returns all the relevant <a>
tags and you can process the list to decide what to keep and what not to keep.
Upvotes: 0