wasd
wasd

Reputation: 1572

Python threading or multiprocessing for web-crawler?

I've made simple web-crawler with Python. So far everything it does it creates set of urls that should be visited, set of urls that was already visited. While parsing page it adds all the links on that page to the should be visited set and page url to the already visited set and so on while length of should_be_visited is > 0. So far it does everything in one thread.

Now I want to add parallelism to this application, so I need to have same kind of set of links and few threads / processes, where each will pop one url from should_be_visited and update already_visited. I'm really lost at threading and multiprocessing, which I should use, do I need some Pools, Queues?

Upvotes: 0

Views: 2067

Answers (2)

BlackBear
BlackBear

Reputation: 22979

Another alternative is asynchronous I/O, which is much better for this kind of I/O-bound tasks (unless processing a page is really expensive). You can try both with asyncio or Tornado, using its httpclient.

Upvotes: 1

Abhinav Upadhyay
Abhinav Upadhyay

Reputation: 2585

The rule of thumb when deciding whether to use threads in Python or not is to ask the question, whether the task that the threads will be doing, is that CPU intensive or I/O intensive. If the answer is I/O intensive, then you can go with threads.

Because of the GIL, the Python interpreter will run only one thread at a time. If a thread is doing some I/O, it will block waiting for the data to become available (from the network connection or the disk, for example), and in the meanwhile the interpreter will context switch to another thread. On the other hand, if the thread is doing a CPU intensive task, the other threads will have to wait till the interpreter decides to run them.

Web crawling is mostly an I/O oriented task, you need to make an HTTP connection, send a request, wait for response. Yes, after you get the response you need to spend some CPU to parse it, but besides that it is mostly I/O work. So, I believe, threads are a suitable choice in this case.

(And of course, respect the robots.txt, and don't storm the servers with too many requests :-)

Upvotes: 4

Related Questions