sudhakar reddy
sudhakar reddy

Reputation: 31

Multiprocessing in python, multiple process running same instructions

I'm using multiprocessing in Python for parallelizing. I'm trying to parallelize the process on chunks of data read from an excel file using pandas.

I'm new to multiprocessing and parallel processing. During implementation on simple code,

import time;
import os;
from multiprocessing import Process
import pandas as pd
print os.getpid();
df = pd.read_csv('train.csv', sep=',',usecols=["POLYLINE"],iterator=True,chunksize=2);
print "hello";
def my_function(chunk):
    print chunk;
count = 0;
processes = [];
for chunk in df:
    if __name__ == '__main__':
        p = Process(target=my_function,args=(chunk,));
        processes.append(p);
    if(count==4):
        break;
    count = count + 1;

The print "hello" is being executed multiple times, I'm guessing the individual process created should work on the target rather than main code.

Can anyone suggest me where I'm wrong.

enter image description here

Upvotes: 1

Views: 2194

Answers (2)

Roland Smith
Roland Smith

Reputation: 43495

Using multiprocessing is probably not going to speed up reading data from disk, since disk access is much slower than e.g. RAM access or calculations. And the different pieces of the file will end up in different processes.

Using mmap could help speed up data access.

If you do a read-only mmap of the data file before starting e.g. a Pool.map, each worker could read its own slice of data from the shared memory mapped file and process it.

Upvotes: 0

Anonymous
Anonymous

Reputation: 12080

The way that multiprocessing works is create a new process and then import the file with the target function. Since your outermost scope has print statements, it will get executed once for every process.

By the way you should use a Pool instead of Processes directly. Here's a cleaned up example:

import os
import time
from multiprocessing import Pool

import pandas as pd

NUM_PROCESSES = 4


def process_chunk(chunk):
    # do something
    return chunk


if __name__ == '__main__':
    df = pd.read_csv('train.csv', sep=',', usecols=["POLYLINE"], iterator=True, chunksize=2)
    pool = Pool(NUM_PROCESSES)

    for result in pool.map(process_chunk, df):
        print result

Upvotes: 3

Related Questions