Reputation: 1236
I've been playing with tornado, twisted, gevent, grequests in order to get the best performance in the process of fetching 50k urls.
The process I want to create:
I'm going to process millions urls a day, I started implementing this but got into a few problems;
firstly, populating a queue with the results I get from the async-crawler consumes to much memory - I need to address it, what would be a good practice? secondly, I'm having hard time synchronizing both threading and the gevent crawler, how to I download asyncly and process while populating the queue with results?
Or, How do I synchronize the async-crawler with the threading code that process the response from the async-crawler?
Thanks!
Upvotes: 0
Views: 692
Reputation: 414585
gevent, twisted, asyncio should handle 50k urls just fine.
To avoid consuming too much memory and to synchronize the processes that download and process urls, you could set the maximum size on the corresponding queues: if the downloading happens too fast; it will block on queue.put()
when it reaches its maximum capacity.
Green threads would be useless for a parallel regex processing. Ordinary Python threads that use real OS threads would be useless here if GIL is not released during the regex processing. re
does not release GIL. regex
module can release GIL in some cases.
If you use re
module; you might want to create a processes pool instead of threads where the number of processes ~number of cpus on the host.
Beware of how you use MySQL, Redis (you might need a green driver for some usage scenarios).
Upvotes: 1