Reputation: 905
I want to read file that has content of around 2GB, I try to use multiprocessing pooling to do, but it am getting error:
TypeError: 'type' object is not iterable
I know that map always accept argument which is iterable, but is there way to do that? Here is my code so far:
def load_embeddings(FileName):
#file = open(FileName,'r')
embeddings = {}
i = 0
print "Loading word embeddings first time"
for line in FileName:
# print line
tokens = line.split('\t')
tokens[-1] = tokens[-1].strip()
#each line has 400 tokens
for i in xrange(1, len(tokens)):
tokens[i] = float(tokens[i])
embeddings[tokens[0]] = tokens[1:-1]
print "finished"
return embeddings
if __name__ == "__main__":
t1 = time.time()
p = Pool(processes=5)
FileName = './asag/Resources/EN-wform.w.5.cbow.neg10.400.subsmpl.txt'
file_ = open(FileName,'r')
#fun = partial(load_embeddings,FileName)
result = p.map(load_embeddings, file_)
p.close()
p.join()
print ("Time it took :" + str(time.time() - t1))
Upvotes: 0
Views: 4467
Reputation: 11593
Your source code would be correct if it ran in a single-process-environment. Although your argument FileName
should be named file
as it's really a open file handle and not a filename (string).
Now, what happens is that you are giving 5 processes the same file handle to work on. With for line in FileName
you doing the read operations on the file handle. And this happens in parallel in 5 different processes. All not knowing from the other ones (that's the beauty of it: for the OS these are different programs running. But they all read from the same file handle). Now, it seems that this is not atomic and this call can be interrupted after an only partial read of the line. Could also be, that python buffers internally but the buffer is per process. That results in having half lines in line
or parts of the first line and parts of the second (because python just reads until it sees the first \n
) and then you get errors when you want to process the line further.
To fix this you'll need to read the file first in your main process and hand the lines to the map
function, like this:
from multiprocessing import Pool
def load_embeddings(line):
embeddings = {}
i = 0
tokens = line.split('\t')
tokens[-1] = tokens[-1].strip()
#each line has 400 tokens
for i in xrange(1, len(tokens)):
tokens[i] = float(tokens[i])
embeddings[tokens[0]] = tokens[1:-1]
print "finished"
return embeddings
if __name__ == "__main__":
p = Pool(processes=5)
file_name = 'file.tsf'
lines = []
with open(file_name,'r') as f:
for line in f:
lines.append(line.strip())
result = p.map(load_embeddings, lines)
p.close()
p.join()
Upvotes: 1