Reputation: 113
This is the thing I've done so far:
import requests
from bs4 import BeautifulSoup
URL = "https://www.google.com/search?q=programming"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'id':'result-stats'})
print(table)
I want it to get the number of results in an integer that would be the number 1350000000.
Upvotes: 3
Views: 2218
Reputation: 1724
If you need to extract just one element, use select_one()
bs4
method. It's a bit more readable and a bit faster than find()
. CSS
selectors reference.
If you need to extract data very fast, try to use selectolax
which is a wrapper of lexbor
HTML Renderer library written in pure C
with no dependencies, and it's fast.
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah definition", # query
"gl": "us", # country
"hl": "en" # language
}
response = requests.get('https://www.google.com/search',
headers=headers,
params=params)
soup = BeautifulSoup(response.text, 'lxml')
# .previous_sibling will go to, well, previous sibling removing unwanted part: "(0.38 seconds)"
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)
# About 107,000 results
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. it's a paid API with a free plan.
The difference in your case is that the only thing that you need to do is to get the data from the structured JSON you want, rather than figuring out how to extract certain elements or how to bypass blocks from Google.
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro dah defenition",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
result = results["search_information"]['total_results']
print(result)
# 107000
P.S - I wrote a blog post about how to scrape Google Organic Results.
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 70
This code will do the trick:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
result = requests.get("https://www.google.com/search?q=programming", headers=headers)
src = result.content
soup = BeautifulSoup(src, 'lxml')
print(soup.find("div", {"id": "result-stats"}))
Upvotes: 0
Reputation: 1710
You are missing header User-Agent which is a string to tell the server what kind of device you are accessing the page with .
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
URL = "https://www.google.com/search?q=programming"
result = requests.get(URL, headers=headers)
soup = BeautifulSoup(result.content, 'html.parser')
total_results_text = soup.find("div", {"id": "result-stats"}).find(text=True, recursive=False) # this will give you the outer text which is like 'About 1,410,000,000 results'
results_num = ''.join([num for num in total_results_text if num.isdigit()]) # now will clean it up and remove all the characters that are not a number .
print(results_num)
Upvotes: 4