Reputation: 1123

Python multithreading without a queue working with large data sets

I am running through a csv file of about 800k rows. I need a threading solution that runs through each row and spawns 32 threads at a time into a worker. I want to do this without a queue. It looks like current python threading solution with a queue is eating up alot of memory.

Basically want to read a csv file row and put into a worker thread. And only want 32 threads running at a time.

This is current script. It appears that it is reading the entire csv file into queue and doing a queue.join(). Is it correct that it is loading the entire csv into a queue then spawning the threads?

queue=Queue.Queue()

def worker():
    while True:
        task=queue.get()
        try:
            subprocess.call(['php {docRoot}/cli.php -u "api/email/ses" -r "{task}"'.format(
                docRoot=docRoot,
                task=task
            )],shell=True)
        except:
            pass
        with lock:
            stats['done']+=1
            if int(time.time())!=stats.get('now'):
                stats.update(
                    now=int(time.time()),
                    percent=(stats.get('done')/stats.get('total'))*100,
                    ps=(stats.get('done')/(time.time()-stats.get('start')))
                )
                print("\r    {percent:.1f}% [{progress:24}] {persec:.3f}/s ({done}/{total}) ETA {eta:<12}".format(
                    percent=stats.get('percent'),
                    progress=('='*int((23*stats.get('percent'))/100))+'>',
                    persec=stats.get('ps'),
                    done=int(stats.get('done')),
                    total=stats.get('total'),
                    eta=snippets.duration.time(int((stats.get('total')-stats.get('done'))/stats.get('ps')))
                ),end='')
           queue.task_done()


    for i in range(32):
        workers=threading.Thread(target=worker)
        workers.daemon=True
        workers.start()
    try:
        with open(csvFile,'rb') as fh:
        try:

                dialect=csv.Sniffer().sniff(fh.readline(),[',',';'])
            fh.seek(0)
            reader=csv.reader(fh,dialect)
            headers=reader.next()
        except csv.Error as e:
            print("\rERROR[CSV] {error}\n".format(error=e))
        else:
            while True:
            try:
                data=reader.next()
            except csv.Error as e:
                print("\rERROR[CSV] - Line {line}: {error}\n".format(                                       line=reader.line_num, error=e))
            except StopIteration:
                break
            else:
                stats['total']+=1
             queue.put(urllib.urlencode(dict(zip(headers,data)+dict(campaign=row.get('Campaign')).items())))
        queue.join()

Upvotes: 0

Answers (4)

Tim Peters

Reputation: 70602

Your question is pretty unclear. Have you tried initializing your Queue to have a maximum size of, say, 64?

myq = Queue.Queue(maxsize=64)

Then a producer (one or more) trying to .put() new items on myq will block until consumers reduce the queue size to less than 64. This will correspondingly limit the amount of memory consumed by the queue. By default, queues are unbounded: if the producer(s) add items faster than consumers take them off, the queue can grow to consume all the RAM you have.

EDIT

This is current script. It appears that it is reading the entire csv file into queue and doing a queue.join(). Is it correct that it is loading the entire csv into a queue then spawning the threads?

The indentation is messed up in your post, so have to guess some, but:

The code obviously starts 32 threads before it opens the CSV file.
You didn't show the code that creates the queue. As already explained above, if it's a Queue.Queue, by default it's unbounded, and can grow to any size if your main loop puts items on it faster than your threads remove items from it. Since you haven't said anything about what worker() does (or shown its code), we don't have enough information to guess whether that's the case. But that memory use is out of hand suggests that's the case.
And, as also explained, you can stop that easily by specifying a maximum size when you create the queue.

To get better answers, supply better info ;-)

ANOTHER EDIT

Well, the indentation is still messed up in spots, but it's better. Have you tried any suggestions? Looks like your worker threads each spawn a new process, so they'll take very much longer than it takes just to read another line from the csv file. So it's indeed very likely that you put items on the queue far faster than they're taken off. So, for the umpteenth time ;-), TRY initializing the queue with (say) maxsize=64. Then reveal what happens.

BTW, the bare except: clause in worker() is a Really Bad Idea. If anything goes wrong, you'll never know. If you have to ignore every possible exception (including even KeyboardInterrupt and SystemExit), at least log the exception info.

And note what @JamesAnderson said: unless you have extraordinary hardware resources, trying to run 32 processes at a time is almost certainly slower than running a number of processes that's no more than twice the number of available cores. Then again, that depends too a lot on what your PHP program does. If, for example, the PHP program uses disk I/O heavily, any multiprocessing may be slower than none.

Upvotes: 0

Michael

Reputation: 13914

Other answers have explained how to use Pool without having to manage queues (it manages them for you) and that you do not want to set the number of processes to 32, but to your CPU count - 1. I would add two things. First, you may want to look at the pandas package, which can easily import your csv file into Python. The second is that the examples of using Pool in the other answers only pass it a function that takes a single argument. Unfortunately, you can only pass Pool a single object with all the inputs for your function, which makes it difficult to use functions that take multiple arguments. Here is code that allows you to call a previously defined function with multiple arguments using pool:

import multiprocessing
from multiprocessing import Pool

def multiplyxy(x,y):

    return x*y



def funkytuple(t):
    """
    Breaks a tuple into a function to be called and a tuple
    of arguments for that function. Changes that new tuple into
    a series of arguments and passes those arguments to the
    function.
    """
    f = t[0]
    t = t[1]



    return f(*t)


def processparallel(func, arglist):
    """
    Takes a function and a list of arguments for that function
    and proccesses in parallel.
    """
    parallelarglist = []

    for entry in arglist:
        parallelarglist.append((func, tuple(entry)))

    cpu_count = int(multiprocessing.cpu_count() - 1)


    pool = Pool(processes = cpu_count)
    database = pool.map(funkytuple, parallelarglist)

    pool.close()
    return database

#Necessary on Windows
if __name__ == '__main__':
    x = [23, 23, 42, 3254, 32]
    y = [324, 234, 12, 425, 13]
    i = 0

    arglist = []
    while i < len(x):
        arglist.append([x[i],y[i]])
        i += 1


    database = processparallel(multiplyxy, arglist)

    print(database)

Upvotes: 0

James Mills

Reputation: 19030

I don't understand why you want to spawn 32 threads per row. However data processing in parallel in a fairly common embarassingly paralell thing to do and easily achievable with Python's multiprocessing library.

Example:

from multiprocessing import Pool

def job(args):
    # do some work

inputs = [...]  # define your inputs
Pool().map(job, inputs)

I leave it up to you to fill in the blanks to meet your specific requirements.

See: https://bitbucket.org/ccaih/ccav/src/tip/bin/ for many examples of this pattenr.

Upvotes: 1

James Anderson

Reputation: 27478

32 threads is probably overkill unless you have some humungous hardware available.

The rule of thumb for optimum number of threads or processes is: (no. of cores * 2) - 1 which comes to either 7 or 15 on most hardware.

The simplest way would be to start 7 threads passing each thread an "offset" as a parameter. i.e. a number from 0 to 7.

Each thread would then skip rows until it reached the "offset" number and process that row. Having processed the row it can skip 6 rows and process the 7th -- repeat until no more rows.

This setup works for threads and multiple processes and is very efficient in I/O on most machines as all the threads should be reading roughly the same part of the file at any given time.

I should add that this method is particularly good for python as each thread is more or less independent once started and avoids the dreaded python global lock common to other methods.

Upvotes: 2

Python multithreading without a queue working with large data sets

Answers (4)

Related Questions