Coryza
Coryza

Reputation: 251

Python Threads not finishing

I'm currently testing something with Threading/ workpool; I create 400 Threads which download a total of 5000 URLS... The problem is that some of the 400 threads are "freezing", when looking into my Processes I see that +- 15 threads in every run freeze, and after a time eventually close 1 by 1.

My question is if there is a way to have some sort of 'timer' / 'counter' that kills a thread if it isn't finished after x seconds.

# download2.py - Download many URLs using multiple threads.
import os
import urllib2
import workerpool
import datetime
from threading import Timer

class DownloadJob(workerpool.Job):
    "Job for downloading a given URL."
    def __init__(self, url):
        self.url = url # The url we'll need to download when the job runs
    def run(self):
        try:
            url = urllib2.urlopen(self.url).read()
        except:
            pass

# Initialize a pool, 400 threads in this case
pool = workerpool.WorkerPool(size=400)

# Loop over urls.txt and create a job to download the URL on each line
print datetime.datetime.now()
for url in open("urls.txt"):
    job = DownloadJob(url.strip())
    pool.put(job)

# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()
print datetime.datetime.now()

Upvotes: 0

Views: 2200

Answers (2)

Aya
Aya

Reputation: 41950

The problem is that some of the 400 threads are "freezing"...

That's most likely because of this line...

url = urllib2.urlopen(self.url).read()

By default, Python will wait forever for a remote server to respond, so if a one of your URLs points to a server which is ignoring the SYN packet, or is otherwise just really slow, the thread could potentially be blocked forever.

You can use the timeout parameter of urlopen() set a limit as to how long the thread will wait for the remote host to respond...

url = urllib2.urlopen(self.url, timeout=5).read() # Time out after 5 seconds

...or you can set it globally instead with socket.setdefaulttimeout() by putting these lines at the top of your code...

import socket
socket.setdefaulttimeout(5) # Time out after 5 seconds

Upvotes: 1

bwbrowning
bwbrowning

Reputation: 6520

urlopen accepts a timeout value, that would be the best way to handle it I think.

But I agree with the commenter that 400 threads is probably way too many

Upvotes: 0

Related Questions