Google scraping using python - requests: How to avoid being blocked due to many requests?

For a school project I need get the web addresses of 200 companies (based on a list). My script is working fine, but when I'm around the company 80, I get blocked by google. This is the message that I'm getting.

> Our systems have detected unusual traffic from your computer network. 
> This page checks to see if it's really you sending the requests, and
> not a robot.  <a href="#"
> onclick="document.getElementById('infoDiv').style.display='block'

I tried two different ways to get my data:

A simple one:

for company_name in data:
     search = company_name
     results = 1
     page = requests.get("https://www.google.com/search?q={}&num={}".format(search, results))

     soup = BeautifulSoup(page.content, "html5lib")

and a more complex on:

for company_name in data:
    search = company_name
    results = 1

    s = requests.Session()
    retries = Retry(total=3, backoff_factor=0.5)
    s.mount('http://', HTTPAdapter(max_retries=retries))
    s.mount('https://', HTTPAdapter(max_retries=retries))
    page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))
    #time.sleep(.600)

    soup = BeautifulSoup(page.content, "html5lib")

But I'm getting the same mistake over and over. Is there a way I could overcome this issue? thanks!

Upvotes: 4

Answers (2)

Teddy Katayama

Reputation: 138

You could try adjusting your user-agent headers to something other than the python requests default. In doing so I was able to search for all 949 companies from https://www.sec.gov/rules/other/4-460list.htm with no issues.

My default user-agent is:

print requests.utils.default_headers()

{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.8.1'}

Which Google might detect as unusual traffic

user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14'
headers = {'User-Agent': user_agent,'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}

for company_name in data:
    search = company_name
    results = 1
    page = requests.get("https://www.google.com/search?q={}&num={}".format(search, results), headers=headers)
    soup = BeautifulSoup(page.content, "html5lib")
    print company_name
    print soup.find('h3', attrs={'class':'r'}).text
    print soup.find('h3', attrs={'class':'r'}).find('a').attrs['href']

Upvotes: 0

abarnert

Reputation: 365657

If you just want to make sure you never make more than 1 request every 0.6 seconds, you just need to sleep until it's been at least 0.6 seconds since the last request.

If the amount of time it takes you to process each request is a tiny fraction of 0.6 seconds, you can uncomment the line already in your code. However, it probably makes more sense to do it at the end of the loop, rather than in the middle:

for company_name in data:
    # blah blah
    page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))
    soup = BeautifulSoup(page.content, "html5lib")
    # do whatever you wanted with soup
    time.sleep(.600)

If your processing takes a sizable fraction of 0.6 seconds, then waiting 0.6 seconds is too long. For example, if it sometimes takes 0.1 seconds, sometimes 1.0, then you want to wait 0.5 seconds in the first case, but not at all in the second, right?

In that case, just keep track of the last time you made a request, and sleep until 0.6 seconds after that:

last_req = time.time()
for company_name in data:
    # blah blah
    page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))
    soup = BeautifulSoup(page.content, "html5lib")
    # do whatever you wanted with soup

    now = time.time()
    delay = last_req + 0.600 - now
    last_req = now
    if delay >= 0:
        time.sleep(delay)

If you need to make requests exactly once every 0.6 seconds—or as close to that as possible—you could kick off a thread that does that, and tosses the results in a queue, while another thread (possibly your main thread) just blocks popping requests off that queue and processing them.

But I can't imagine why you'd need that.

Upvotes: 3

Google scraping using python - requests: How to avoid being blocked due to many requests?

Answers (2)

Related Questions