user5841014
user5841014

Reputation: 61

opening top 5 results on google

I am trying to teach myself webscraping and so have been going through chapter 11 of the book "Automate the Boring Stuff," which can be seen below:

https://automatetheboringstuff.com/chapter11/

One part which has gotten me stuck is an exercise where you open the top four search results from a google search of whatever term is on your clipboard.

when I run the code, even copying and pasting it from the book it does not seem to store any results like the book says it should. I have tried tracking the issue and I think the problem is when I save the search webpage as a variable it doesn't save anything. So when it then tries to open the first five pages there aren't any to open. Below is my code the only change I had to do from the book was add 'lxml' to beautiful soup commands.

#! python3
# lucky.py - Opens several Google search results.

import requests, sys, webbrowser, bs4

print('Googling...') # display text while downloading the Google page
res = requests.get('http://google.com/search?q=' + ' '.join(sys.argv[1:]), 'lxml')
res.raise_for_status()

# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text, 'lxml')

# Open a browser tab for each result.
linkElems = soup.select('.r a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    webbrowser.open('http://google.com' + linkElems[i].get('href'))

Thanks

Upvotes: 2

Views: 2233

Answers (4)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

If you're using find_all() then you can simply use limit argument, but if you're using select(), then slicing is needed (CSS selectors reference, SelectorGadget extension to grab CSS selectors by clicking on the desired element in your browser):

# will grab from 0 to 5th (included) index
linkElems = soup.select('.r a')[:5]
# or using find_all()
linkElems = soup.find_all('div', 'r a', limit=5)

Another problem that will appear is that there's no user-agent specified. For example, default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.

I wrote a dedicated blog about how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions.

Pass user-agent in request headers:

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('YOUR_URL', headers=headers)

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "samurai cop what does katana mean",
  "gl": "us",
  "hl": "en",
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc')[:5]:
  link = result.select_one('.yuRUbf a')['href']

  print(link, sep='\n')

--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
https://www.imdb.com/title/tt0130236/characters/nm0360481
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with selecting correct selectors or figuring out why certain things don't work as expected. Instead, you only need to iterate over structured JSON and get the date you want.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "samurai cop what does katana mean",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"][:5]:
  print(result['link'])

---------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
https://www.imdb.com/title/tt0130236/characters/nm0360481
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''

Disclaimer, I work for SerpApi.

Upvotes: 0

Joe R
Joe R

Reputation: 21

Depending on the version of the book you may be searching the pypi.org website instead of Google. When inspecting search results on pypi.org you see the element you want is "package-snippet" so you can use the href to complete the URL urlToOpen = 'http://pypi.org' + linkElems[i].get('href')

This is what the element looks like: <a class="package-snippet" href="/project/boring/">. In this case the search term passed in the argument was "boring stuff" and "boring" (A small HTTP web server for wsgi compatible apps) happens to be the first result. The href will be "/project/boring/" which we will later join to the base url.

#!/usr/bin/env python3
# searchpypi.py - Opens several search results.

import requests, sys, webbrowser, bs4
print('Searching...')  #Display text while downloading search result page.  
res = requests.get('https://pypi.org/search/?q=' + ' '.join(sys.argv[1:])) 
print(sys.argv[1:])
res.raise_for_status()

# Retrieve top search result links. 
soup = bs4.BeautifulSoup(res.text, 'html.parser')
#Open a browser tab for each result, the element looks like this when search the term "boring stuff": <a class="package-snippet" href="/project/boring/">
linkElems = soup.select('.package-snippet')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    # join the href to the base url 
    urlToOpen = 'http://pypi.org' + linkElems[i].get('href')
    print('Opening', urlToOpen)
    webbrowser.open(urlToOpen)

To run this you need to be in the terminal in the folder where searchpypi.py was saved and enter the command: python3 searchpypi.py boring stuff

It uses the argument after the file name, in this case, "boring stuff" as the search term. This can be changed to whatever you want to search pypi.org for.

Upvotes: 2

jlbnjmn
jlbnjmn

Reputation: 958

Are you running the code from the command line and passing it arguments?

Without search terms, the url request goes to http://google.com/search?q= which redirects to Google's home page, where no HTML elements fit the search criteria.

The code from worked when ran from the command line with arguments. The only change I made is switching to the html5lib parser.

To see if the lack of command line arguments is the issue, try the following code:

import requests, sys, webbrowser, bs4

search_term = 'python'

print('Googling...') # display text while downloading the Google page
res = requests.get('http://google.com/search?q={0}'.format(search_term), 'lxml')
res.raise_for_status()

# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text, 'lxml')

# Open a browser tab for each result.
linkElems = soup.select('.r a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    webbrowser.open('http://google.com' + linkElems[i].get('href'))

If that works, then try running your original code with search terms, like this:

$ python your_code.py things to search for

Upvotes: 1

user3672754
user3672754

Reputation:

I found funny what you are doing and I wanted to help you. I didn't take a look to your book, but here's a working script that you will be able to adapt.

#!/usr/bin/python3


import requests
import sys
import webbrowser


word_to_search='test'

request = requests.get('http://google.com/search?q='+word_to_search)
content=request.content.decode('UTF-8','replace')

#
# Parse the content and get the links.  I had a problem with 
# bs4 so I manually searched over the content
#
links=[]
while '<h3 class="r">' in content:
    content=content.split('<h3 class="r">', 1)[1]
    split_content=content.split('</h3>', 1)
    link='http'+split_content[1].split(':http',1)[1].split('%',1)[0]
    links.append(link)
    content=split_content[1]


for link in links[:5]:  # max number of links 5
    webbrowser.open(link)

maybe your script is not working because there's no '.r a'

Upvotes: 2

Related Questions