Reputation: 669
For a school project I need get the web addresses of 200 companies (based on a list). My script is working fine, but when I'm around the company 80, I get blocked by google. This is the message that I'm getting.
> Our systems have detected unusual traffic from your computer network.
> This page checks to see if it's really you sending the requests, and
> not a robot. <a href="#"
> onclick="document.getElementById('infoDiv').style.display='block'
I tried two different ways to get my data:
A simple one:
for company_name in data:
search = company_name
results = 1
page = requests.get("https://www.google.com/search?q={}&num={}".format(search, results))
soup = BeautifulSoup(page.content, "html5lib")
and a more complex on:
for company_name in data:
search = company_name
results = 1
s = requests.Session()
retries = Retry(total=3, backoff_factor=0.5)
s.mount('http://', HTTPAdapter(max_retries=retries))
s.mount('https://', HTTPAdapter(max_retries=retries))
page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))
#time.sleep(.600)
soup = BeautifulSoup(page.content, "html5lib")
But I'm getting the same mistake over and over. Is there a way I could overcome this issue? thanks!
Upvotes: 4
Views: 8975
Reputation: 138
You could try adjusting your user-agent headers to something other than the python requests
default. In doing so I was able to search for all 949 companies from https://www.sec.gov/rules/other/4-460list.htm with no issues.
My default user-agent is:
print requests.utils.default_headers()
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.8.1'}
Which Google might detect as unusual traffic
user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14'
headers = {'User-Agent': user_agent,'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
for company_name in data:
search = company_name
results = 1
page = requests.get("https://www.google.com/search?q={}&num={}".format(search, results), headers=headers)
soup = BeautifulSoup(page.content, "html5lib")
print company_name
print soup.find('h3', attrs={'class':'r'}).text
print soup.find('h3', attrs={'class':'r'}).find('a').attrs['href']
Upvotes: 0
Reputation: 365657
If you just want to make sure you never make more than 1 request every 0.6 seconds, you just need to sleep until it's been at least 0.6 seconds since the last request.
If the amount of time it takes you to process each request is a tiny fraction of 0.6 seconds, you can uncomment the line already in your code. However, it probably makes more sense to do it at the end of the loop, rather than in the middle:
for company_name in data:
# blah blah
page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))
soup = BeautifulSoup(page.content, "html5lib")
# do whatever you wanted with soup
time.sleep(.600)
If your processing takes a sizable fraction of 0.6 seconds, then waiting 0.6 seconds is too long. For example, if it sometimes takes 0.1 seconds, sometimes 1.0, then you want to wait 0.5 seconds in the first case, but not at all in the second, right?
In that case, just keep track of the last time you made a request, and sleep until 0.6 seconds after that:
last_req = time.time()
for company_name in data:
# blah blah
page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))
soup = BeautifulSoup(page.content, "html5lib")
# do whatever you wanted with soup
now = time.time()
delay = last_req + 0.600 - now
last_req = now
if delay >= 0:
time.sleep(delay)
If you need to make requests exactly once every 0.6 seconds—or as close to that as possible—you could kick off a thread that does that, and tosses the results in a queue, while another thread (possibly your main thread) just blocks popping requests off that queue and processing them.
But I can't imagine why you'd need that.
Upvotes: 3