Reputation: 167
I'm trying to use the multiprocessing
package to concurrently read a file and overwrite (parts of) it after some data trasformation. I understand it seems a bit abstract, but I have an use for this kink of concurrency for speedup my own blocksync
fork.
Below you can find my code snipplet:
#!/usr/bin/python2
import multiprocessing
import sys
import time
blocksize=1024
def do_open(f, mode):
f = open(f, mode)
f.seek(0, 2)
size = f.tell()
f.seek(0)
return f, size
def pipe_getblocks(f, pipe, side):
print "Child file object ID: "+str(id(f))
while True:
print "getblocks_seek_prev: "+str(f.tell())
block = f.read(blocksize)
if not block:
break
print "getblocks_seek_next: "+str(f.tell())
pipe.send(block)
def pipe_server(dev):
f, size = do_open(dev, 'r+')
parent,child = multiprocessing.Pipe(False)
reader = multiprocessing.Process(target=pipe_getblocks, args=(f,child,"R"))
reader.daemon = True
reader.start()
child.close()
i = 0
print "Parent file object ID:"+str(id(f))
while True:
try:
block = parent.recv()
except:
break
else:
print str(i)+":pseek: "+str(f.tell()/1024/1024)
f.seek(0,0) # This seek should not be see in the child subprocess...
i = i+1
pipe_server("/root/random.img")
Basically, the parent process should wait for the child to populate the pipe, then read from it. Please pay attention to the f.seek(0,0)
line: I put it here to verify that parent and child each have its own idea of where to seek in the file. In other words, being two different processes entirely, I expect that a f.seek
done on parent had no effect on its child.
However, it seems that this assumption is wrong, as the above program produce the following output:
Child file object ID: 140374094691616
getblocks_seek_prev: 0
getblocks_seek_next: 1024
...
getblocks_seek_next: 15360
getblocks_seek_prev: 15360
getblocks_seek_next: 16384
getblocks_seek_prev: 16384
getblocks_seek_next: 17408 <-- past EOF!
getblocks_seek_prev: 17408 <-- past EOF!
getblocks_seek_next: 18432 <-- past EOF!
getblocks_seek_prev: 18432 <-- past EOF!
...
Parent file object ID:140374094691616
0:pseek: 0
1:pseek: 0
2:pseek: 0
3:pseek: 0
4:pseek: 0
5:pseek: 0
6:pseek: 0
7:pseek: 0
8:pseek: 0
9:pseek: 0
10:pseek: 0
...
As you can see, the child process read past its EOF or, well, it think so, because it is really reading from the start of the file. In short, it appear that the parent's f.seek(0,0)
has effect on the child process, without it recognizing that.
My assumption is that the file object is stored on shared memory, so both processes are modifing the same data/object. This idea seems confirmed by the id(f)
taken from both parent and child processes, which report identical data. However, I found no reference stating that file object are kept in shared memory when using the multiprocessing
package.
So, my question is: it that the expected behavior, or I am missing something obvious?
Upvotes: 2
Views: 1388
Reputation: 12721
Python is starting the child process using fork()
, which causes the child to inherit the file descriptors from its parent. Since they're sharing a file descriptor, they also share the same seek offsets. From the manpage of fork(2)
:
The child inherits copies of the parent's set of open file descriptors. Each file descriptor in the child refers to the same open file description (see open(2)) as the corresponding file descriptor in the parent. This means that the two file descriptors share open file status flags, file offset, and signal- driven I/O attributes (see the description of F_SETOWN and F_SETSIG in fcntl(2)).
Python file
objects on unix are very thin wrappers over file descriptors (the implementation in Python currently boils down to an fdno and some metadata about the path; the seek()
method is just calling lseek(2)
), so cloning the object into the child process is basically just sending a file descriptor across.
The easiest solution I can think of is to pass the path to the child process and open the file in each process separately. You could maybe do some tricky stuff with os.dup
, but I'm not sure that there would be any advantages to that beyond saving some bytes when you spawn the new process.
Upvotes: 5
Reputation: 2740
I think you want to do a separate open for each child process. Instead of passing the file object as an argument, try passing the path to the file instead, then open inside the function that multiprocessing calls.
Upvotes: 1