Reputation: 31
When speed of queue.put() is faster than queue.get() speed, I find the P1 process is going to use large memory (because P1 put line from a large text file incessantly). Even P2 has finished getting line from queue subsequently. The memory that was used by P1 don't still release. How to fix this issue? Below is sample and test code.
Thanks!
import time
from multiprocessing import Process, Queue
def addline(q):
f = file('a big text file','r')
line = True
while line:
line = f.readline()
q.put(line, False)
f.close()
print "P1:finished"
while 1:
time.sleep(2)
def getline(q):
f = file('/tmp/bak','w')
line = True
while line:
line=q.get()
f.write(line)
time.sleep(0.01)
f.close()
print "p2:finished"
if __name__ == "__main__":
q = Queue()
p1 = Process(name="addline", target=addline, args=(q,))
p2 = Process(name="getline", target=getline, args=(q,))
p1.start()
p2.start()
Edit: I try to read a text file(44MB) and observe /proc/pid/smaps. I found the memory which hasn't been released become Private_Dirty in heap.
00fb3000-04643000 rw-p 00000000 00:00 0 [heap]
Size: 55872 kB
Rss: 55844 kB
Pss: 55163 kB
Shared_Clean: 0 kB
Shared_Dirty: 1024 kB
Private_Clean: 0 kB
Private_Dirty: 54820 kB
Referenced: 54972 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Upvotes: 3
Views: 4491
Reputation: 35826
Python's garbage collector deletes an object as soon as it is not referenced anymore. As long as the write rate can keep up with read rate of your storage hardware, reading the content of a file while writing it at the same time from two independent threads/processes must be possible without growing memory and with a small memory footprint. I believe your problem disappears when you use Python's language constructs that are more appropriate for your use case. I'll try to comment on that.
For reading a file line by line you should use the following concept:
with open('filepath') as f:
for line in f:
do_something_with(line)
You do not have to explicitly .close()
the file then. The same applies to writing the file. Read about the with
statement here: http://effbot.org/zone/python-with-statement.htm
From my point of view, for the use case you've presented, a multiprocessing.Pipe
instead of a multiprocessing.Queue
would be more appropriate because of the "stream-like" application. It seems odd to represent raw file content as items in a queue. Furthermore, you can get rid of a lot of communication overhead if you'd use threads instead of independent subprocesses (then you should use an os.pipe
for inter-thread communication)**. In any case, you should join()
threads and subprocesses after starting them.
** For your use case (copying a file), the global interpreter lock (GIL) will not be a performance problem.
Upvotes: 1