Zack Plauché
Zack Plauché

Reputation: 4240

soup.select('.r a') in f'https://google.com/search?q={query}' brings back empty list in Python BeautifulSoup. **NOT A DUPLICATE**

The Situation:

The "I'm Feeling Lucky!" project in the "Automate the boring stuff with Python" ebook no longer works with the code he provided.

Specifically:

linkElems = soup.select('.r a')

What I have done: I've already tried using the solution provided within this stackoverflow question

I'm also currently using the same search format.

Code:

    import webbrowser, requests, bs4

    def im_feeling_lucky():
    
        # Make search query look like Google's
        search = '+'.join(input('Search Google: ').split(" "))
  
        # Pull html from Google
        print('Googling...') # display text while downloading the Google page
        res = requests.get(f'https://google.com/search?q={search}&oq={search}')
        res.raise_for_status()

        # Retrieve top search result link
        soup = bs4.BeautifulSoup(res.text, features='lxml')


        # Open a browser tab for each result.
        linkElems = soup.select('.r')  # Returns empty list
        numOpen = min(5, len(linkElems))
        print('Before for loop')
        for i in range(numOpen):
            webbrowser.open(f'http://google.com{linkElems[i].get("href")}')

The Problem:

The linkElems variable returns an empty list [] and the program doesn't do anything past that.

The Question:

Could sombody please guide me to he correct way of handling this and perhaps explain why it isn't working?

Upvotes: 1

Views: 2222

Answers (4)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

There's actually no need to save the HTML file, and one of the reasons why response output is different from the one you see in the browser is that there are no headers being sent with the request, in this case, user-agent which will act as a "real" user visit (already written by Cucurucho).

When no user-agent is specified (when using requests library) it defaults to python-requests thus Google understands it, blocks a request and you receive a different HTML with different CSS selectors. Check what's your user-agent.

Pass user-agent:

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

requests.get('URL', headers=headers)

To easier grab CSS selectors, have a look at the SelectorGadget extension to get CSS selectors by clicking on the desired element in your browser.


Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

params = {
  'q': 'how to create minecraft server',
  'gl': 'us',
  'hl': 'en',
}

html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

# [:5] - first 5 results
# container with needed data: title, link, snippet, etc.
for result in soup.select('.tF2Cxc')[:5]:
  link = result.select_one('.yuRUbf a')['href']
  print(link, sep='\n')

----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
https://minecraft.fandom.com/wiki/Tutorials/Setting_up_a_server
https://codewizardshq.com/how-to-make-a-minecraft-server/
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to spend time thinking about how to bypass blocks from Google or what is the right CSS selector to parse the data, instead, you need to pass parameters (params) you want, and iterate over structured JSON and get the data you want.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "how to create minecraft server",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"][:5]:
  print(result["link"], sep="\n")


----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
https://minecraft.fandom.com/wiki/Tutorials/Setting_up_a_server
https://codewizardshq.com/how-to-make-a-minecraft-server/
'''

Disclaimer, I work for SerpApi.

Upvotes: 0

Cucurucho
Cucurucho

Reputation: 71

Different websites (for instance Google) generate different HTML codes to different User-Agents (this is how the web browser is identified by the website). Another solution to your problem is to use a browser User-Agent to ensure that the HTML code you obtain from the website is the same you would get by using "view page source" in your browser. The following code just prints the list of google search result urls, not the same as the book you've referenced but it's still useful to show the point.

#! python3
# lucky.py - Opens several Google search results.

import requests, sys, webbrowser, bs4
print('Please enter your search term:')
searchTerm = input()
print('Googling...')    # display thext while downloading the Google page

url = 'http://google.com/search?q=' + ' '.join(searchTerm)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

res = requests.get(url, headers=headers)
res.raise_for_status()


# Retrieve top search results links.
soup = bs4.BeautifulSoup(res.content)

# Open a browser tab for each result.
linkElems = soup.select('.r > a')   # Used '.r > a' instead of '.r a' because
numOpen = min(5, len(linkElems))    # there are many href after div class="r"
for i in range(numOpen):
  # webbrowser.open('http://google.com' + linkElems[i].get('href'))
  print(linkElems[i].get('href'))

Upvotes: 2

EngieViral
EngieViral

Reputation: 41

I took a different route. I saved the HTML from the request and opened that page, then I inspected the elements. It turns out that the page is different if I open it natively in the Chrome browser compared to what my python request is served. I identified the div with the class that appears to denote a result and supplemented that for the .r - in my case it was .kCrYT

#! python3

# lucky.py - Opens several Google Search results.

import requests, sys, webbrowser, bs4

print('Googling...') # display text while the google page is downloading

url= 'http://www.google.com.au/search?q=' + ' '.join(sys.argv[1:])
url = url.replace(' ','+')


res = requests.get(url)
res.raise_for_status()


# Retrieve top search result links.
soup=bs4.BeautifulSoup(res.text, 'html.parser')


# get all of the 'a' tags afer an element with the class 'kCrYT' (which are the results)
linkElems = soup.select('.kCrYT > a') 

# Open a browser tab for each result.
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    webbrowser.open_new_tab('http://google.com.au' + linkElems[i].get('href'))

Upvotes: 4

Aravind Emmadishetty
Aravind Emmadishetty

Reputation: 521

I too had had the same problem while reading that book and found a solution for that problem.

replacing

soup.select('.r a')

with

soup.select('div#main > div > div > div > a')

will solve that issue

following is the code that will work

import webbrowser, requests, bs4 , sys

print('Googling...')
res = requests.get('https://google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text)

linkElems = soup.select('div#main > div > div > div > a')  
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    webbrowser.open('http://google.com' + linkElems[i].get("href"))

the above code takes input from commandline arguments

Upvotes: 5

Related Questions