Khris
Khris

Reputation: 3212

Python Multiprocessing using Pool goes recursively haywire

I'm trying to make an expensive part of my pandas calculations parallel to speed up things.

I've already managed to make Multiprocessing.Pool work with a simple example:

import multiprocessing as mpr
import numpy as np

def Test(l):
  for i in range(len(l)):
    l[i] = i**2
  return l

t = list(np.arange(100))
L = [t,t,t,t]
if __name__ == "__main__":
  pool = mpr.Pool(processes=4)
  E = pool.map(Test,L)
  pool.close()
  pool.join()

No problems here. Now my own algorithm is a bit more complicated, I can't post it here in its full glory and terribleness, so I'll use some pseudo-code to outline the things I'm doing there:

import pandas as pd
import time
import datetime as dt
import multiprocessing as mpr
import MPFunctions as mpf --> self-written worker functions that get called for the multiprocessing
import ClassGetDataFrames as gd  --> self-written class that reads in all the data and puts it into dataframes

=== Settings

=== Use ClassGetDataFrames to get data

=== Lots of single-thread calculations and manipulations on the dataframe

=== Cut dataframe into 4 evenly big chunks, make list of them called DDC

if __name__ == "__main__":
  pool = mpr.Pool(processes=4)
  LLT = pool.map(mpf.processChunks,DDC)
  pool.close()
  pool.join()

=== Join processed Chunks LLT back into one dataframe

=== More calculations and manipulations

=== Data Output

When I'm running this script the following happens:

  1. It reads in the data.

  2. It does all calculations and manipulations until the Pool statement.

  3. Suddenly it reads in the data again, fourfold.

  4. Then it goes into the main script fourfold at the same time.

  5. The whole thing cascades recursively and goes haywire.

I have read before that this can happen if you're not careful, but I do not know why it does happen here. My multiprocessing code is protected by the needed name-main-statement (I'm on Win7 64), it is only 4 lines long, it has close and join statements, it calls one defined worker function which then calls a second worker function in a loop, that's it. By all I know it should just create the pool with four processes, call the four processes from the imported script, close the pool and wait until everything is done, then just continue with the script. On a sidenote, I first had the worker functions in the same script, the behaviour was the same. Instead of just doing what's in the pool it seems to restart the whole script fourfold.

Can anyone enlighten me what might cause this behaviour? I seem to be missing some crucial understanding about Python's multiprocessing behaviour.

Also I don't know if it's important, I'm on a virtual machine that sits on my company's mainframe.

Do I have to use individual processes instead of a pool?

Upvotes: 2

Views: 244

Answers (1)

Khris
Khris

Reputation: 3212

I managed to make it work by enceasing the entire script into the if __name__ == "__main__":-statement, not just the multiprocessing part.

Upvotes: 2

Related Questions