how to crawl and process (cpu intensive) thousands of URLs using gevent and threading?

Question

I've been playing with tornado, twisted, gevent, grequests in order to get the best performance in the process of fetching 50k urls.

The process I want to create:

parse all urls into a set (avoid duplicates)
for each url: check existence of the url in a whitelist redis db (contains millions of urls)
download the urls using gevent/other async libraries
insert the fetched content into a queue
in parallel: listen to the queue with threads
process (intensive regex) the queue items using threads
save the output to a MySQL DB
for each url update the whitelist redis db

I'm going to process millions urls a day, I started implementing this but got into a few problems;

firstly, populating a queue with the results I get from the async-crawler consumes to much memory - I need to address it, what would be a good practice? secondly, I'm having hard time synchronizing both threading and the gevent crawler, how to I download asyncly and process while populating the queue with results?

Or, How do I synchronize the async-crawler with the threading code that process the response from the async-crawler?

Thanks!

how to crawl and process (cpu intensive) thousands of URLs using gevent and threading?

Answers (1)

Related Questions