Daniel
Daniel

Reputation: 113

How to print the number of google search results (Beautifulsoup)

This is the thing I've done so far:

import requests
from bs4 import BeautifulSoup

URL = "https://www.google.com/search?q=programming"
r = requests.get(URL) 

soup = BeautifulSoup(r.content, 'html5lib')

table = soup.find('div', attrs = {'id':'result-stats'}) 

print(table)

I want it to get the number of results in an integer that would be the number 1350000000.

Upvotes: 3

Views: 2218

Answers (3)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

If you need to extract just one element, use select_one() bs4 method. It's a bit more readable and a bit faster than find(). CSS selectors reference.

If you need to extract data very fast, try to use selectolax which is a wrapper of lexbor HTML Renderer library written in pure C with no dependencies, and it's fast.

Code and example in the online IDE:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "fus ro dah definition",  # query
  "gl": "us",                    # country 
  "hl": "en"                     # language
}

response = requests.get('https://www.google.com/search',
                        headers=headers,
                        params=params)
soup = BeautifulSoup(response.text, 'lxml')

# .previous_sibling will go to, well, previous sibling removing unwanted part: "(0.38 seconds)"
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)

# About 107,000 results

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. it's a paid API with a free plan.

The difference in your case is that the only thing that you need to do is to get the data from the structured JSON you want, rather than figuring out how to extract certain elements or how to bypass blocks from Google.

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "fus ro dah defenition",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

result = results["search_information"]['total_results']
print(result)

# 107000

P.S - I wrote a blog post about how to scrape Google Organic Results.

Disclaimer, I work for SerpApi.

Upvotes: 0

HSB
HSB

Reputation: 70

This code will do the trick:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
result = requests.get("https://www.google.com/search?q=programming", headers=headers)

src = result.content
soup = BeautifulSoup(src, 'lxml')

print(soup.find("div", {"id": "result-stats"}))

Upvotes: 0

Ahmed Soliman
Ahmed Soliman

Reputation: 1710

You are missing header User-Agent which is a string to tell the server what kind of device you are accessing the page with .

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
URL     = "https://www.google.com/search?q=programming"
result = requests.get(URL, headers=headers)    

soup = BeautifulSoup(result.content, 'html.parser')

total_results_text = soup.find("div", {"id": "result-stats"}).find(text=True, recursive=False) # this will give you the outer text which is like 'About 1,410,000,000 results'
results_num = ''.join([num for num in total_results_text if num.isdigit()]) # now will clean it up and remove all the characters that are not a number .
print(results_num)

Upvotes: 4

Related Questions