Reputation: 251
I'm currently testing something with Threading/ workpool; I create 400 Threads which download a total of 5000 URLS... The problem is that some of the 400 threads are "freezing", when looking into my Processes I see that +- 15 threads in every run freeze, and after a time eventually close 1 by 1.
My question is if there is a way to have some sort of 'timer' / 'counter' that kills a thread if it isn't finished after x seconds.
# download2.py - Download many URLs using multiple threads.
import os
import urllib2
import workerpool
import datetime
from threading import Timer
class DownloadJob(workerpool.Job):
"Job for downloading a given URL."
def __init__(self, url):
self.url = url # The url we'll need to download when the job runs
def run(self):
try:
url = urllib2.urlopen(self.url).read()
except:
pass
# Initialize a pool, 400 threads in this case
pool = workerpool.WorkerPool(size=400)
# Loop over urls.txt and create a job to download the URL on each line
print datetime.datetime.now()
for url in open("urls.txt"):
job = DownloadJob(url.strip())
pool.put(job)
# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()
print datetime.datetime.now()
Upvotes: 0
Views: 2200
Reputation: 41950
The problem is that some of the 400 threads are "freezing"...
That's most likely because of this line...
url = urllib2.urlopen(self.url).read()
By default, Python will wait forever for a remote server to respond, so if a one of your URLs points to a server which is ignoring the SYN
packet, or is otherwise just really slow, the thread could potentially be blocked forever.
You can use the timeout
parameter of urlopen()
set a limit as to how long the thread will wait for the remote host to respond...
url = urllib2.urlopen(self.url, timeout=5).read() # Time out after 5 seconds
...or you can set it globally instead with socket.setdefaulttimeout()
by putting these lines at the top of your code...
import socket
socket.setdefaulttimeout(5) # Time out after 5 seconds
Upvotes: 1
Reputation: 6520
urlopen accepts a timeout value, that would be the best way to handle it I think.
But I agree with the commenter that 400 threads is probably way too many
Upvotes: 0