Reputation: 15
I want to scrape all the distances in the given google result image. I was able to scrape the first distance but I am not able to scrape the 2nd and 3rd distances. I am using the code below to scrape the first distance.
qstr = quote("distance between zip codes 75000 paris and 75016 paris")
url_getallfolders='https://www.google.com/search?q='+qstr
response=requests.get(url_getallfolders)
url_getallfolders
soup=BeautifulSoup(response.content,'lxml')
#print(response.text)
tagc = soup.select("div.kCrYT span")
codes = [i.text.strip() for i in tagc]
print(codes)
Upvotes: 0
Views: 246
Reputation: 1734
Search query | Result |
---|---|
distance between zip codes 75000 paris and 75016 paris |
zero relevance results. |
distance between zip 75000 paris and zip 75016 paris |
desired results. |
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "distance between zip 75000 paris and zip 75016 paris",
"hl": "en",
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4758.87 Safari/537.36",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for result in soup.select(".uE1RRc"):
print(result.text)
"""
38 min (15.8 km) via Bd Périphérique
38 min (11.1 km) via Av. de New York
44 min (12.4 km) via Bd Haussmann and Bd Périphérique
"""
Alternatively, you can achieve it by using a Google Answer Box API from SerpApi. It's a paid API with a free plan.
The main difference is that you don't have to figure out how to parse the data, bypass blocks from Google, and don't have to maintain the parser.
Example code to integrate:
from serpapi import GoogleSearch
import os, json
# https://docs.python.org/3/library/os.html#os.getenv
params = {
"api_key": os.getenv("API_KEY"), # Your SerpAPi API key
"engine": "google", # search engine
"q": "what distance between zip 75000 paris and zip 75016 paris", # query
"hl": "en" # language
# other search parameters
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
routes = results["answer_box"]["routes"]
print(json.dumps(routes, indent=2, ensure_ascii=False))
Output from three routes:
[
{
"summary": "48 min (11.1 km) via Av. de New York",
"formatted": {
"duration": "48 min",
"distance": "11.1 km",
"via": "Av. de New York"
},
"link": "https://www.google.com/maps/dir/75000+Paris,+France/75016+Paris,+France/data=!4m8!4m7!1m2!1m1!1s0x47e66e74623cb693:0x10389ef77ae91296!1m2!1m1!1s0x47e67ab45134ecd9:0x1c0b82c6e1d851f0!3e0?sa=X&hl=en"
},
{
"summary": "50 min (15.8 km) via Bd Périphérique",
"formatted": {
"duration": "50 min",
"distance": "15.8 km",
"via": "Bd Périphérique"
},
"link": "https://www.google.com/maps/dir/75000+Paris,+France/75016+Paris,+France/data=!4m8!4m7!1m2!1m1!1s0x47e66e74623cb693:0x10389ef77ae91296!1m2!1m1!1s0x47e67ab45134ecd9:0x1c0b82c6e1d851f0!3e0?sa=X&hl=en"
},
{
"summary": "52 min (12.4 km) via Bd Haussmann and Bd Périphérique",
"formatted": {
"duration": "52 min",
"distance": "12.4 km",
"via": "Bd Haussmann and Bd Périphérique"
},
"link": "https://www.google.com/maps/dir/75000+Paris,+France/75016+Paris,+France/data=!4m8!4m7!1m2!1m1!1s0x47e66e74623cb693:0x10389ef77ae91296!1m2!1m1!1s0x47e67ab45134ecd9:0x1c0b82c6e1d851f0!3e0?sa=X&hl=en"
}
]
Disclaimer, I work for SerpApi.
Upvotes: 1
Reputation: 195613
You can use regex pattern in soup.find()
to find the distance (Also set User-Agent
Http header).
For example:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.google.com/search?hl=en&q=distance%20between%20zip%20codes%2075000%20paris%20and%2075016%20paris'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print(soup.find(text=re.compile(r'\d+\.\d+\s*km')))
Prints:
15.8 km
Upvotes: 0