Reputation: 23
I'm currently ingesting data through an API that returns close to 100,000 documents in a paginated fashion (100 per page). I currently have some code that roughly functions as follows:
while c <= limit:
if not api_url:
break
req = urllib2.Request(api_url)
opener = urllib2.build_opener()
f = opener.open(req)
response = simplejson.load(f)
for item in response['documents']:
# DO SOMETHING HERE
if 'more_url' in response:
api_url = response['more_url']
else:
api_url = None
break
c += 1
Downloading the data this way is really slow and I was wondering if there is any way to loop through the pages in an async way. I have been recommended to take a look at twisted, but I am not entirely sure how to proceed.
Upvotes: 2
Views: 3715
Reputation: 3742
What you have here is that you do not know up front about what will be read next unless you will call an API. Think of this like, what you can do in parallel?
I do not know how much you can do in parallel and which tasks, but lets try...
some assumptions: - you can retrieve data from the API without penalties or limits - data processing of one page/batch can be done independently one from other
what is slow is an IO - so immediately you can split your code to two parallel running tasks - one that will read data, then put it in the queue and continue reading unless hit limit/empty response or pause if queue is full
then second task, that is taking data from queue, and do something with data
so you can call one task from another
other approach is that you have one task, that is calling other one immediately after data is read, so their execution will be running in parallel but slightly shifted
how I'll implement it? as celery tasks and yes requests
e.g. the second one:
@task
def do_data_process(data):
# do something with data
pass
@task
def parse_one_page(url):
response = requests.get(url)
data = response.json()
if 'more_url' in data:
parse_one_page.delay(data['more_url'])
# and here do data processing in this task
do_data_process(data)
# or call worker and try to do this in other process
# do_data_process.delay(data)
and it is up to you how many tasks you will run in parallel if you will add limits to your code, you can even have workers on multiple machines and have separate queues for parse_one_page
and do_data_process
why this approach, not twisted or async?
because you have cpu-bond data processing (json, then data) and for this is better to have separate processes and celery is perfect with them.
Upvotes: 2