ganjaam
ganjaam

Reputation: 1286

Scrape google snippet text in a certain language (English) using python and selenium

I was trying to scrape the snippet text from google search page and this solution worked well. The only issue I have now is that the text is in Bangla while I want it in English.

Here's what I've tried:

options = webdriver.ChromeOptions()
options.add_argument('lang=en')

driver = webdriver.Chrome(executable_path=r'the\path\for\chromedriver.exe', options=options)

I've tried adding 'lang=en' as an argument to ChromeOptions and pass it to webdriver.Chrome(). That's all I could figure out but it's not working.

Here's the full code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', {'intl.accept_languages': 'en,en_US'})
options.add_argument('lang=en')

driver = webdriver.Chrome(executable_path=r'C:\Users\camoh\AppData\Local\Programs\Python\Python38\chromedriver.exe', options=options)

driver.get('https://google.com/')
assert "Google" in driver.title
#wait = WebDriverWait(driver, 20)
#wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".gLFyf.gsfi")))
input_field = driver.find_element_by_css_selector(".gLFyf.gsfi")
input_field.send_keys("when barack obama born")
input_field.send_keys(Keys.RETURN)

#wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".Z0LcW.XcVN5d")))
result = driver.find_element_by_css_selector(".Z0LcW.XcVN5d").text
print(result)
driver.close()
driver.quit()

Here's the page when I run the code:

seelnium output

Upvotes: 1

Views: 689

Answers (3)

Denis Skopa
Denis Skopa

Reputation: 99

For scraping Google Search Answer Box no need to use selenium you can extract it using BeautifulSoup web scraping library only.

For example, in the requests library you can pass URL parameters such as hl, gl or location for language, country of the search and location accordingly:

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    # other parameters
    "hl": "en",                             # language, english, https://serpapi.com/google-languages
    "gl": "us",                             # country of the search, US -> USA, https://serpapi.com/google-countries
}

Check code in the online IDE.

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "when barack obama born",          # query example
    "hl": "en",                             # language, english, https://serpapi.com/google-languages
    "gl": "us",                             # country of the search, US -> USA, https://serpapi.com/google-countries
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')

title = soup.select_one(".i29hTd").text
date = soup.select_one(".t2b5Cf").text
age = soup.select_one(".kZ91ed").text

print(title, date, age)

Output:

Barack Obama/Date of birth August 4, 1961

Also if you change the parameters "hl": "en","gl": "us" to "hl": "de","gl": "de" you can get the output in German:

Barack Obama/Geburtsdatum 4. August 1961 Alter 61 Jahre

Also you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan. The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.

Code example:

from serpapi import GoogleSearch
import os

params = {
  "engine": "google",              # SerpApi search engine
  "q": "when barack obama born",   # query(answer)
  "api_key": "...",                # serpapi key, https://serpapi.com/manage-api-key 
  "hl": "en",                      # language
  "gl": "us"                       # country of the search, US -> USA
}

search = GoogleSearch(params)      # where data extraction happens
results = search.get_dict()        # JSON -> Python dictionary
answer_box = results["answer_box"]

title = answer_box.get("title")
answer = answer_box.get("answer")

print(title, answer)

Output:

Barack Obama/Date of birth August 4, 1961

Upvotes: 1

Bhavya Parikh
Bhavya Parikh

Reputation: 3400

You can try with below code to add argument with preferred language:

from selenium.webdriver.chrome.options import Options as ChromeOptions #import library
options=webdriver.ChromeOptions() #create object of ChromeOptions 
options.add_argument("--lang=en")   
options.add_argument("--lang=en-US")#or you can use 

Upvotes: 2

Swaroop Humane
Swaroop Humane

Reputation: 1836

Use -

options.add_experimental_option('prefs', {'intl.accept_languages': 'en,en_US'})

Upvotes: 1

Related Questions