Reputation: 4687
In my subclass of RequestHandler, I am trying to fetch range of urls:
class GetStats(webapp2.RequestHandler):
def post(self):
lastpage = 50
for page in range(1, lastpage):
tmpurl = url + str(page)
response = urllib2.urlopen(tmpurl, timeout=5)
html = response.read()
# some parsing html
heap.append(result_of_parsing)
self.response.write(heap)
But it works with ~ 30 urls (page is loading long but it is works). In case more than 30 I am getting an error:
Error: Server Error
The server encountered an error and could not complete your request.
Please try again in 30 seconds.
Is there any way to fetch a lot of urls? May be more optimal or smth? Up to several hundreds of pages?
Update:
I am using BeautifulSoup to parse every single page. I found this traceback in gae logs:
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 68, in post
heap = get_times(tmp_url, 160)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 106, in get_times
soup = BeautifulSoup(html)
File "libs/bs4/__init__.py", line 168, in __init__
self._feed()
File "libs/bs4/__init__.py", line 181, in _feed
self.builder.feed(self.markup)
File "libs/bs4/builder/_htmlparser.py", line 56, in feed
super(HTMLParserTreeBuilder, self).feed(markup)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 155, in goahead
startswith = rawdata.startswith
DeadlineExceededError
Upvotes: 0
Views: 330
Reputation: 9116
It's failing because you only have 60 seconds to return a response to the user and I'm going to guess it's taking longer then that.
You will want to use this: https://cloud.google.com/appengine/articles/deferred
to create a task that has a 10 minute time out. Then you can return instantly to the user and they can "pick up" the results at a later time via another handler (that you create). If collecting all the URLs takes longer then 10 minutes you'll have to split them up into further tasks.
See this: https://cloud.google.com/appengine/articles/deadlineexceedederrors
to understand why you cannot go longer then 60 seconds.
Upvotes: 5
Reputation: 1913
Edit: Might come from Appengine quotas and limits. Sorry for previous answer:
As this looks like a protection from server for avoiding ddos or scrapping from one client. You have few options:
Waiting between a certain number of queries before continuing.
Making request from several clients who has different IP address and sending information back to your main script (might be costly to rent different server for this..).
You could also watch if website as api to access the data you need.
You should also take care as the sitowner could block/blacklist your IP if he decides your request are not good.
Upvotes: 0