Macaulay
Macaulay

Reputation: 79

How to detect captchas when scraping google?

I'm using the requests package with BeautifulSoup to scrape Google News for the number of search results for a query. I'm getting two types of IndexError, which I want to distinguish between:

  1. When the number of search results is empty. Here #resultStats returns the empty string '[]'. What seems to be going on is that when a query string is too long, google doesn't even say "0 search results"; it just doesn't say anything.
  2. The second IndexError is when google gives me a captcha.

I need to distinguish between these cases, because I want my scraper to wait five minutes when google sends me a captcha, but not when it's just an empty results string.

I currently have a jury-rigged approach, where I send another query with a known nonzero number of search results, which allows me to distinguish between the two IndexErrors. I'm wondering if there's a more elegant and direct approach to doing this, using BeautifulSoup.

Here's my code:

import requests, bs4, lxml, re, time, random
import pandas as pd
import numpy as np

URL = 'https://www.google.com/search?tbm=nws&q={query}&tbs=cdr%3A1%2Ccd_min%3A{year}%2Ccd_max%3A{year}&authuser=0'
headers = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}

def tester(): # test for captcha
    test = requests.get('https://www.google.ca/search?q=donald+trump&safe=off&client=ubuntu&espv=2&biw=1910&bih=969&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2016%2Ccd_max%3A&tbm=nws', headers=headers)
    dump = bs4.BeautifulSoup(test.text,"lxml")
    result = dump.select('#resultStats')
    num = result[0].getText()
    num = re.search(r"\b\d[\d,.]*\b",num).group() # regex
    num = int(num.replace(',',''))
    num = (num > 0)
    return num

def search(**params):
    response = requests.get(URL.format(**params),headers=headers)
    print(response.content, response.status_code) # check this for google requiring Captcha
    soup = bs4.BeautifulSoup(response.text,"lxml")
    elems = soup.select('#resultStats')

    try: # want code to flag if I get a Captcha
        hits = elems[0].getText()
        hits = re.search(r"\b\d[\d,.]*\b",hits).group() # regex
        hits = int(hits.replace(',',''))
        print(hits)    
        return hits
    except IndexError:
        try:
            tester() > 0 # if captcha, this will throw up another IndexError
            print("Empty results!")
            hits = 0
            return hits
        except IndexError:
            print("Captcha'd!")
            time.sleep(120) # should make it rotate IP when captcha'd
            hits = 0
            return hits

for qry in list:
    hits = search(query= qry, year=2016)

Upvotes: 2

Views: 4860

Answers (1)

alecxe
alecxe

Reputation: 473763

I'd just search for the "captcha" element, for example, if this is Google Recaptcha, you can search for the hidden input containing the token:

is_captcha_on_page = soup.find("input", id="recaptcha-token") is not None

Upvotes: 3

Related Questions