python - multiprocessing is slower than sequential

Question

this is my first multiprocessing implementation, i have executed my code in sequential approach and it took me a minute to process around 30seconds to process 20 records. But i created a dictionary with each key having a set of records, and tried to apply the function using pool.map for every key. Now it is taking more than 2 minute to process though i am assigining each core for each process. Could someone help me to optimize this.

def f(values):
    data1 = itertools.combinations(values,2)
    tuple_attr =('Age', 'Workclass', 'Fnlwgt', 'Education', 'Education-num', 'marital-status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native country', 'Probability', 'Id')
    new = ((tuple_attr[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]) for pair in data1)
    skt = set(frozenset(temp) for temp in new)
    newset = set(s for s in skt if not any(p < s for p in skt))

    empty = frozenset(" ")
    tr_x = set(frozenset(i) for i in empty)
    tr = set(frozenset(i) for i in empty)
    for e in newset:
        tr.clear()
        tr = tr.union(tr_x)
        tr_x.clear()
        for x in tr:
            for a in e:
                if x == empty:
                    tmp = frozenset(frozenset([a]))
                    tr_x = tr_x.union([tmp])
                else : 
                    tmp = frozenset(frozenset([a]).union(x))
                    tr_x = tr_x.union([tmp])
        tr.clear()
        tr = tr.union(tr_x)
        tr = set(l for l in tr if not any(m < l for m in tr))

    return tr

def main():
    p = Pool(len(data)) #number of processes = number of CPUs
    keys, values= zip(*data.items()) #ordered keys and values
    processed_values= p.map( f, values )
    result= dict( zip(keys, processed_values ) ) 
    p.close() # no more tasks
    p.join()  # wrap up current tasks
    print(result)


if __name__ == '__main__':
    import csv
    dicchunk = {*****} #my dictionary
    main()

dano · Accepted Answer

I created a test program to run this once with multiprocessing, and once without:

def main(data):
    p = Pool(len(data)) #number of processes = number of CPUs
    keys, values= zip(*data.items()) #ordered keys and values
    start = time.time()
    processed_values= p.map( f, values )
    result= dict( zip(keys, processed_values ) ) 
    print("multi: {}".format(time.time() - start))
    p.close() # no more tasks
    p.join()  # wrap up current tasks

    start = time.time()
    processed_values = map(f, values)
    result2 = dict( zip(keys, processed_values ) ) 
    print("non-multi: {}".format(time.time() - start))
    assert(result == result2)

Here's the output:

multi: 191.249588966
non-multi: 225.774535179

multiprocessing is faster, but not by as much as you might expect. The reason for that is some of the sub-lists take much (several minutes) longer to finish than others. You'll never be faster than however long it takes to process the largest sub-list.

I added some tracing to the worker function to demonstrate this. I saved the time at the start of the worker, and the print it out at the end. Here's the output:

 is done. Took 0.940237998962 seconds
 is done. Took 1.28068685532 seconds
 is done. Took 42.9250118732 seconds
 is done. Took 193.635578156 seconds

As you can see, the workers are doing very unequal amounts of work, so you're only saving about 44 seconds vs being sequential.

python - multiprocessing is slower than sequential

Answers (1)

Related Questions