Reputation: 2205
Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.
I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.
At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.
Why would this create a problem? Should I limit the number of simultaneous threads (and how)?
import urllib
import thread
def getName(input):
l = input.split(".com/")
m = l[1]
return m
def parseMessages():
theFile = open('messages.html', 'r')
theLines = theFile.readlines()
theFile.close()
theNewFile = open('new321.txt','w')
for z in theLines:
if "321cheese" in z:
theNewFile.write(z)
theNewFile.close()
def downloadImage(inputURL):
urllib.urlretrieve (inputURL, "./grabNew/" + d)
parseMessages()
f = open('new321.txt', 'r')
lines = f.readlines()
f.close()
g = open('output.txt', 'w')
for x in lines:
a = x.split("<a href=\"")
b = a[1].split("\"")
c = b[0]
if ".png" in c:
d = getName(c)
g.write(c+"\n")
thread.start_new_thread( downloadImage, (c,) )
##downloadImage(c)
g.close()
Upvotes: 0
Views: 128
Reputation: 414675
There are multiple issues in your code.
The main issue is d
global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage()
.
The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures
(available on Python 2 as futures
) or multiprocessing.Pool
:
#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit
download_dir = "grabNew"
def url2filename(url):
return basename(unquote(urlsplit(url).path).decode('utf-8'))
def download_image(url):
filename = None
try:
filename = os.path.join(download_dir, url2filename(url))
return urllib.urlretrieve(url, filename), None
except Exception as e:
return (filename, None), e
def main():
pool = Pool(processes=10)
for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
pass # do something with the downloaded file or handle an error
if __name__ == "__main__":
main()
Upvotes: 1
Reputation: 7608
Did you make sure your parsing is working correctly?
Also, you are launching too many threads.
And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.
Upvotes: 1