Reputation: 72

Getting blocked trying to scrape google search results

I am using python and BeautifulSoup to scrape through google search results. But I run into captchas as soon as I make more than 10 requests.

I tried using python requests library and passing user agent, giving proxy, sleeps, verify=False, and every imaginable thing just to get rid of these captchas but they just don't give up!

I tried using selenium webdriver (headless) but of no avail.

I tried using python cURL request. It lasts longer than the requests and selenium, but it eventually gets blocked.

I just want to scrape google search results peacefully and anonymously. Any advice please?

Upvotes: 2

Answers (2)

miroku47

Reputation: 239

HTTP header information is often used by sites that incorporate anti-blocking technology to flag users as potential bots or crawlers. In other words, you'll need to make sure that the header information, which is part of your overall browser fingerprint, does not give you away as a bot/crawler. For a site like Google, it might be worth going with more advanced off-the-shelf scrapers just to save a lot of hassle. They take care of proxy rotation, browser fingerprint and header information to prevent blocks, and some of these solutions also incorporate the SERP API to fetch search engine results data.

Upvotes: 2

Denis Skopa

Reputation: 99

If you are making a large number of requests for web scraping a website, it's a good idea to make each request look random by sending a different set of HTTP headers (user-agent rotation) to make it look like the request is coming from different computers/different browser:

import requests, random

user_agent_list = [
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]

for _ in user_agent_list:
  #Pick a random user agent
  user_agent = random.choice(user_agent_list)

  #Set the headers 
  headers = {'User-Agent': user_agent}

requests.get('URL', headers=headers)

In addition to the rotate user-agent, you can rotate proxies (ideally residential) that can be used in combination with CAPTCHA solver to bypass CAPTCHA.

Also if nothing works you can use a Google Search Engine Results API alternative from a third-party API SerpApi. It's a paid API with a free plan.

It will bypass blocks (including CAPTCHA) from Google and other search engines, and no need to create the parser and maintain it.

This block of code shows how to collect data from all pages (example in the online IDE):

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
  "api_key": os.getenv("API_KEY"), # serpapi key
  "engine": "google",              # serpapi parser engine
  "q": "tesla",                    # search query
  "num": "100"                     # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params)      # where data extraction happens

organic_results_data = []
page_num = 0

while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    page_num += 1
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "page_num": page_num,
            "title": result.get("title"),
            "link": result.get("link"),
            "displayed_link": result.get("displayed_link"),   
        })
    
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

Example output:

[
  {
    "page_num": 1,
    "title": "Tesla the Band | Official Website | American Made Rock 'n' ...",
    "link": "https://teslatheband.com/",
    "displayed_link": "https://teslatheband.com"
  },
  {
    "page_num": 1,
    "title": "TSLA: Tesla Inc - Stock Price, Quote and News - CNBC",
    "link": "https://www.cnbc.com/quotes/TSLA",
    "displayed_link": "https://www.cnbc.com › quotes › TSLA"
  },
  {
    "page_num": 1,
    "title": "Tesla, Inc. (TSLA) Stock Price, News, Quote & History",
    "link": "https://finance.yahoo.com/quote/TSLA/",
    "displayed_link": "https://finance.yahoo.com › quote › TSLA"
  },
  # ...
]

Disclaimer, I work for SerpApi.

Upvotes: 5

Getting blocked trying to scrape google search results

Answers (2)

Related Questions