Reputation: 15

Webscraping google results using Python

I want to scrape all the distances in the given google result image. I was able to scrape the first distance but I am not able to scrape the 2nd and 3rd distances. I am using the code below to scrape the first distance.

qstr = quote("distance between zip codes 75000 paris and 75016 paris")
url_getallfolders='https://www.google.com/search?q='+qstr
response=requests.get(url_getallfolders)
url_getallfolders
soup=BeautifulSoup(response.content,'lxml')
#print(response.text)
tagc = soup.select("div.kCrYT span")
codes = [i.text.strip() for i in tagc]
print(codes)

Google result

Upvotes: 0

Answers (2)

Dmitriy Zub

Reputation: 1734

Search query	Result
`distance between zip codes 75000 paris and 75016 paris`	zero relevance results.
`distance between zip 75000 paris and zip 75016 paris`	desired results.

so_1

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "distance between zip 75000 paris and zip 75016 paris",
    "hl": "en",
    }

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4758.87 Safari/537.36",
    }

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

for result in soup.select(".uE1RRc"):
    print(result.text)


"""
38 min (15.8 km) via Bd Périphérique
38 min (11.1 km) via Av. de New York
44 min (12.4 km) via Bd Haussmann and Bd Périphérique
"""

Alternatively, you can achieve it by using a Google Answer Box API from SerpApi. It's a paid API with a free plan.

The main difference is that you don't have to figure out how to parse the data, bypass blocks from Google, and don't have to maintain the parser.

Example code to integrate:

from serpapi import GoogleSearch
import os, json

# https://docs.python.org/3/library/os.html#os.getenv

params = {
  "api_key": os.getenv("API_KEY"), # Your SerpAPi API key
  "engine": "google",              # search engine 
  "q": "what distance between zip 75000 paris and zip 75016 paris", # query
  "hl": "en"                       # language
  # other search parameters 
}

search = GoogleSearch(params)      # where data extraction happens
results = search.get_dict()        # JSON -> Python dictionary

routes = results["answer_box"]["routes"]
print(json.dumps(routes, indent=2, ensure_ascii=False))

Output from three routes:

[
  {
    "summary": "48 min (11.1 km) via Av. de New York",
    "formatted": {
      "duration": "48 min",
      "distance": "11.1 km",
      "via": "Av. de New York"
    },
    "link": "https://www.google.com/maps/dir/75000+Paris,+France/75016+Paris,+France/data=!4m8!4m7!1m2!1m1!1s0x47e66e74623cb693:0x10389ef77ae91296!1m2!1m1!1s0x47e67ab45134ecd9:0x1c0b82c6e1d851f0!3e0?sa=X&hl=en"
  },
  {
    "summary": "50 min (15.8 km) via Bd Périphérique",
    "formatted": {
      "duration": "50 min",
      "distance": "15.8 km",
      "via": "Bd Périphérique"
    },
    "link": "https://www.google.com/maps/dir/75000+Paris,+France/75016+Paris,+France/data=!4m8!4m7!1m2!1m1!1s0x47e66e74623cb693:0x10389ef77ae91296!1m2!1m1!1s0x47e67ab45134ecd9:0x1c0b82c6e1d851f0!3e0?sa=X&hl=en"
  },
  {
    "summary": "52 min (12.4 km) via Bd Haussmann and Bd Périphérique",
    "formatted": {
      "duration": "52 min",
      "distance": "12.4 km",
      "via": "Bd Haussmann and Bd Périphérique"
    },
    "link": "https://www.google.com/maps/dir/75000+Paris,+France/75016+Paris,+France/data=!4m8!4m7!1m2!1m1!1s0x47e66e74623cb693:0x10389ef77ae91296!1m2!1m1!1s0x47e67ab45134ecd9:0x1c0b82c6e1d851f0!3e0?sa=X&hl=en"
  }
]

Disclaimer, I work for SerpApi.

Upvotes: 1

Andrej Kesely

Reputation: 195613

You can use regex pattern in soup.find() to find the distance (Also set User-Agent Http header).

For example:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://www.google.com/search?hl=en&q=distance%20between%20zip%20codes%2075000%20paris%20and%2075016%20paris'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

print(soup.find(text=re.compile(r'\d+\.\d+\s*km')))

Prints:

15.8 km

Upvotes: 0

Webscraping google results using Python

Answers (2)

Related Questions