YSY
YSY

Reputation: 1236

how to crawl and process (cpu intensive) thousands of URLs using gevent and threading?

I've been playing with tornado, twisted, gevent, grequests in order to get the best performance in the process of fetching 50k urls.

The process I want to create:

I'm going to process millions urls a day, I started implementing this but got into a few problems;

firstly, populating a queue with the results I get from the async-crawler consumes to much memory - I need to address it, what would be a good practice? secondly, I'm having hard time synchronizing both threading and the gevent crawler, how to I download asyncly and process while populating the queue with results?

Or, How do I synchronize the async-crawler with the threading code that process the response from the async-crawler?

Thanks!

Upvotes: 0

Views: 692

Answers (1)

jfs
jfs

Reputation: 414585

gevent, twisted, asyncio should handle 50k urls just fine.

To avoid consuming too much memory and to synchronize the processes that download and process urls, you could set the maximum size on the corresponding queues: if the downloading happens too fast; it will block on queue.put() when it reaches its maximum capacity.

Green threads would be useless for a parallel regex processing. Ordinary Python threads that use real OS threads would be useless here if GIL is not released during the regex processing. re does not release GIL. regex module can release GIL in some cases.

If you use re module; you might want to create a processes pool instead of threads where the number of processes ~number of cpus on the host.

Beware of how you use MySQL, Redis (you might need a green driver for some usage scenarios).

Upvotes: 1

Related Questions