Reputation: 301
I'm setting up a Django
Views that requests products data from an API, parse them with BeautifulSoup
, apply the googletrans
module and save the response into my Postgresql database.
Everything was working fine yesterday until suddenly, Google blocked access on my IP address for too many requests at once..
I just turned on my LTE to change my IP address and it worked.
But now, to make sure that it doesn't happen with this IP address again I need to find a way to call the googletrans
API in batches or any other solution that would prevent me from getting blocked again.
This is my Views:
from bs4 import BeautifulSoup
from googletrans import Translator
import requests
import json
def api_data(request):
if request.GET.get('mybtn'): # to improve, == 'something':
resp_1 = requests.get(
"https://www.headout.com/api/public/v1/product/listing/list-by/city?language=fr&cityCode=PARIS&limit=5000¤cyCode=CAD",
headers={
"Headout-Auth": HEADOUT_PRODUCTION_API_KEY
})
resp_1_data = resp_1.json()
base_url_2 = "https://www.headout.com/api/public/v1/product/get/"
translator = Translator()
for item in resp_1_data['items']:
print('translating item {}'.format(item['id']))
# concat ID to the URL string
url = '{}{}'.format(base_url_2, item['id'] + '?language=fr')
# make the HTTP request
resp_2 = requests.get(
url,
headers={
"Headout-Auth": HEADOUT_PRODUCTION_API_KEY
})
resp_2_data = resp_2.json()
descriptiontxt = resp_2_data['contentListHtml'][0]['html'][0:2040] + ' ...'
#Parsing work
soup = BeautifulSoup(descriptiontxt, 'lxml')
parsed = soup.find('p').text
#Translation doesn't work
translation = translator.translate(parsed, dest='fr')
titlename = item['name']
titlefr = translator.translate(titlename, dest='fr')
destinationname = item['city']['name']
destinationfr = translator.translate(destinationname, dest='fr')
Product.objects.get_or_create(
title=titlefr.text,
destination=destinationfr.text,
description=translation.text,
link=item['canonicalUrl'],
image=item['image']['url']
)
return render(request, "form.html")
How can I call the Google translation API in Batch? Or is there any other solution for that?
Please help.
EDIT
Based on @ddor254 where should I put the: time.sleep(2)
?
This is what I came out with, is this okay?
Product.objects.get_or_create(
title=titlefr.text,
destination=destinationfr.text,
description=translation.text,
link=item['canonicalUrl'],
image=item['image']['url']
)time.sleep(2) #here
or like this:
resp_1 = requests.get(
"https://www.headout.com/api/public/v1/product/listing/list-by/city?language=fr&cityCode=PARIS&limit=5000¤cyCode=CAD",
headers={
"Headout-Auth": HEADOUT_PRODUCTION_API_KEY
}, time.sleep(2)) #here
Just want to make sure that its the right way to do it before risking of getting this new IP also blocked.
Upvotes: 3
Views: 10972
Reputation: 1
My IP is blocked after ~450 concurrent connections. I am using php for loop to translate my text array.
So, I changed my IP Address and and changed my code for waiting after every x seconds.
My Codes in For loop ($i is value from for loop):
if ($i % 100 == 0 && $i!=0) {
//wait 60 seconds every 100
usleep(60000000); // 60 seconds
echo str_pad("XX--> WAITING 60 SECONDS<br>",4096);
}
else
if ($i % 10 == 0 && $i!=0) {
//wait 15 seconds every 10
usleep(15000000); // 15 seconds
echo str_pad("XX--> WAITING 15 SECONDS<br>",4096);
}
else
if ($i % 2 == 0 && $i!=0) {
//wait 2 seconds every 2
usleep(2000000); // 2 seconds
echo str_pad("XX--> WAITING 2 SECONDS<br>",4096);
}
Upvotes: 0
Reputation: 848
I have been blocked too because of many concurrent requests. Usually always gets blocked after 500 concurrent requests. What I did was to put a timeout of 60 seconds after every 100 concurrent requests. It may seem long, but it works. You could also achieve that with a 45 seconds timeout, but I set it to 60 just to make sure.
Here's an example
class GoogleAPI():
def __init__(self):
self.limit_before_timeout = 100
self.timeout = 60
def translate(self, source):
translation = translator.translate(source, dest="ar")
translation = translation.__dict__['text']
if translation != "" and translation is not None:
return translation
def process(self):
i = 0
print("initiation")
for t in list_of_data:
if i < self.limit_before_timeout:
i += 1
self.translate(t)
else:
i = 0
print("100 words added")
time.sleep(self.timeout)
print("All done")
Upvotes: 1
Reputation: 171
Try adding delays between consecutive queries(using sleep) and play with the numbers to see what works for you. 2s delay after every pair of translation and 15s after every 10 to works fine for me.
Upvotes: 2
Reputation: 1628
I suggest you read this article from MDN: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429
if this is the response you get so try and look at the header Retry-After
in the response object.
so adding a sleep or other delay method, with the value of that header might fix your problem.
Upvotes: 1