Geesh_SO
Geesh_SO

Reputation: 2205

Python Multithreading Not Functioning

Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.

I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.

At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.

Why would this create a problem? Should I limit the number of simultaneous threads (and how)?

import urllib
import thread

def getName(input):
    l = input.split(".com/")
    m = l[1]
    return m

def parseMessages():
    theFile = open('messages.html', 'r')
    theLines = theFile.readlines()
    theFile.close()
    theNewFile = open('new321.txt','w')
    for z in theLines:
        if "321cheese" in z:
            theNewFile.write(z)
    theNewFile.close()

def downloadImage(inputURL):
    urllib.urlretrieve (inputURL, "./grabNew/" + d)

parseMessages()

f = open('new321.txt', 'r')
lines = f.readlines()
f.close()

g = open('output.txt', 'w')

for x in lines:
  a = x.split("<a href=\"")
  b = a[1].split("\"")
  c = b[0]
  if ".png" in c:
    d = getName(c)
    g.write(c+"\n")
    thread.start_new_thread( downloadImage, (c,) )
    ##downloadImage(c)

g.close()

Upvotes: 0

Views: 128

Answers (2)

jfs
jfs

Reputation: 414675

There are multiple issues in your code.

The main issue is d global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage().

The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures (available on Python 2 as futures) or multiprocessing.Pool:

#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit


download_dir = "grabNew"

def url2filename(url):
    return basename(unquote(urlsplit(url).path).decode('utf-8'))

def download_image(url):
    filename = None
    try:
        filename = os.path.join(download_dir, url2filename(url))
        return urllib.urlretrieve(url, filename), None
    except Exception as e:
        return (filename, None), e

def main():
    pool = Pool(processes=10)
    for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
        pass # do something with the downloaded file or handle an error

if __name__ == "__main__":
   main()

Upvotes: 1

LtWorf
LtWorf

Reputation: 7608

Did you make sure your parsing is working correctly?

Also, you are launching too many threads.

And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.

Upvotes: 1

Related Questions