Python multiprocessing and file seeks

Question

I'm trying to use the multiprocessing package to concurrently read a file and overwrite (parts of) it after some data trasformation. I understand it seems a bit abstract, but I have an use for this kink of concurrency for speedup my own blocksync fork.

Below you can find my code snipplet:

#!/usr/bin/python2
import multiprocessing
import sys
import time
blocksize=1024

def do_open(f, mode):
    f = open(f, mode)
    f.seek(0, 2)
    size = f.tell()
    f.seek(0)
    return f, size

def pipe_getblocks(f, pipe, side):
    print "Child file object ID: "+str(id(f))
    while True:
        print "getblocks_seek_prev: "+str(f.tell())
        block = f.read(blocksize)
        if not block:
            break
        print "getblocks_seek_next: "+str(f.tell())
        pipe.send(block)

def pipe_server(dev):
    f, size = do_open(dev, 'r+')
    parent,child = multiprocessing.Pipe(False)
    reader = multiprocessing.Process(target=pipe_getblocks, args=(f,child,"R"))
    reader.daemon = True
    reader.start()
    child.close()
    i = 0
    print "Parent file object ID:"+str(id(f))
    while True:
        try:
            block = parent.recv()
        except:
            break
        else:
            print str(i)+":pseek: "+str(f.tell()/1024/1024)
            f.seek(0,0) # This seek should not be see in the child subprocess...
            i = i+1

pipe_server("/root/random.img")

Basically, the parent process should wait for the child to populate the pipe, then read from it. Please pay attention to the f.seek(0,0) line: I put it here to verify that parent and child each have its own idea of where to seek in the file. In other words, being two different processes entirely, I expect that a f.seek done on parent had no effect on its child.

However, it seems that this assumption is wrong, as the above program produce the following output:

Child file object ID: 140374094691616
getblocks_seek_prev: 0
getblocks_seek_next: 1024
...
getblocks_seek_next: 15360
getblocks_seek_prev: 15360
getblocks_seek_next: 16384
getblocks_seek_prev: 16384
getblocks_seek_next: 17408 <-- past EOF!
getblocks_seek_prev: 17408 <-- past EOF!
getblocks_seek_next: 18432 <-- past EOF!
getblocks_seek_prev: 18432 <-- past EOF!
...
Parent file object ID:140374094691616
0:pseek: 0
1:pseek: 0
2:pseek: 0
3:pseek: 0
4:pseek: 0
5:pseek: 0
6:pseek: 0
7:pseek: 0
8:pseek: 0
9:pseek: 0
10:pseek: 0
...

As you can see, the child process read past its EOF or, well, it think so, because it is really reading from the start of the file. In short, it appear that the parent's f.seek(0,0) has effect on the child process, without it recognizing that.

My assumption is that the file object is stored on shared memory, so both processes are modifing the same data/object. This idea seems confirmed by the id(f) taken from both parent and child processes, which report identical data. However, I found no reference stating that file object are kept in shared memory when using the multiprocessing package.

So, my question is: it that the expected behavior, or I am missing something obvious?

Haldean Brown · Accepted Answer

Python is starting the child process using fork(), which causes the child to inherit the file descriptors from its parent. Since they're sharing a file descriptor, they also share the same seek offsets. From the manpage of fork(2):

The child inherits copies of the parent's set of open file descriptors. Each file descriptor in the child refers to the same open file description (see open(2)) as the corresponding file descriptor in the parent. This means that the two file descriptors share open file status flags, file offset, and signal- driven I/O attributes (see the description of F_SETOWN and F_SETSIG in fcntl(2)).

Python file objects on unix are very thin wrappers over file descriptors (the implementation in Python currently boils down to an fdno and some metadata about the path; the seek() method is just calling lseek(2)), so cloning the object into the child process is basically just sending a file descriptor across.

The easiest solution I can think of is to pass the path to the child process and open the file in each process separately. You could maybe do some tricky stuff with os.dup, but I'm not sure that there would be any advantages to that beyond saving some bytes when you spawn the new process.

Python multiprocessing and file seeks

Answers (2)

Related Questions