Reputation: 1731
I am having some difficulty in passing a database connection object or the cursor object using pool.map in the Python multiprocesing package. Basically, I want to create a pool of workers each with its own state and a db connection, so that they can execute queries in parallel.
I have tried these approaches, but I am getting a picklingerror in python with them -
Use Initializer to set up multiprocess pool
The second link is exactly what I need to do, meaning I'd like each process to open a database connection when it starts, then use that connection to process the data/args that are passed in.
Here is my code.
import multiprocessing as mp
def process_data((id,db)):
print 'in processdata'
cursor = db.cursor()
query = ....
#cursor.execute(query)
#....
.....
.....
return row
`if __name__ == '__main__':
db = getConnection()
cursor = db.cursor()
print 'Initialised db connection and cursor'
inputs = [1,2,3,4,5]
pool = mp.Pool(processes=2)
result_list = pool.map(process_data,zip(inputs,repeat(db)))
#print result_list
pool.close()
pool.join()
`
This results in the following error -
`Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/usr/lib/python2.6/threading.py", line 484, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.6/multiprocessing/pool.py", line 225, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'module'>: attribute lookup __builtin__.module failed`
I guess the db or the cursor object is not picklable according to python, because if I replace repeat(db), to repeat(x) where x is an int or string , it works. I have tried using the initializer function and it seems to work, initially but weird things happen when I execute queries, many return nothing for an id, when there is data present.
What would be the best way to achieve this? I am using python 2.6.6 on a linux machine.
Upvotes: 5
Views: 6710
Reputation: 682
While I agree that we should avoid passing database connections from parent to child processes, it's a good idea to ensure that your child processes are not needlessly re-establishing connections on every function invocation.
There are a number of ways to do this, but the way I've found most straightforward is to move the initialization of the stateful object (e.g., a DB connection, a boto3 client, etc) to a function that you memoize. This avoids the problem mentioned above while making it possible to take advantage of features such as connection pools and the like.
Here's an example that demonstrates that each worker is using the same client across function invocations:
import boto3
from functools import cache
from multiprocessing import Pool, current_process
@cache
def get_client():
return boto3.client("glue")
def do_work(_):
client = get_client()
print(f"process {current_process()} has glue cilent {id(client)}")
def main():
with Pool(processes=4) as pool:
pool.map(do_work, range(12))
if __name__ == "__main__":
main()
I hope this is helpful for others!
Upvotes: 0
Reputation: 25
Try pickling the db connection object. Pickling is independent of processes. So it might work..
Ref these pages - python pickle
pickle examples
Upvotes: -3
Reputation: 47988
I'm going to go ahead and put my comment up as an answer, because I think it's appropriate as one. You don't want to try to pass database connections from your parent process to your children processes. You want to move static data or other objects that can be serialized to your children processes. You can pass rows of data, etc. Or you want to have your children establish their own database connections when they become necessary.
Upvotes: 10