Martyn
Martyn

Reputation: 806

Python - Terminate certain threads

I have the following code in which I am scraping multiple websites:

while len(newData) > 0:
    for i in newData:
        try:
            thread.start_new_thread(download, (i))
        except Exception, thread.error:
            pass

however my problem is that it is doing duplicates and scraping each website a few times each. Inside the download function once it has been downloaded I remove the url from newData so no more threads should be opened. How can I kill all threads attempting to do a certain task once it has already been done? This is my first attempt at threading and not sure if I am doing this the correct way.

Upvotes: 0

Views: 53

Answers (2)

Aaron Digulla
Aaron Digulla

Reputation: 328770

Instead of doing it yourself, create a queue. Put objects in the queue which contain all the data necessary to start the task. Create a pool of workers which wait for elements in the queue. Have them put their results into another (output / result) queue.

When starting, create the data objects which contain the URL, etc. and put them all into the queue.

Then you just need to wait for the results to come in into the output queue.

Upvotes: 1

Aida Paul
Aida Paul

Reputation: 2722

First you may want to look into http://scrapy.org/ which is a great framework for web scraping.

As you do it now you would need to write thread manager which will hold handles to all of them, with some sort of notation what is inside (like a check-sum of url) and once certain check-sum is done kill the other threads with said check-sum.

But keep in mind that it's not a good idea to just kill off threads like that, much better solution would be to implement a queue which will make sure that you are not going to parse duplicates and create threads only for those. There are some nice examples of worker pooling and queues in the official manual so have a look.

Upvotes: 0

Related Questions