How to read file using multiprocessing pooling

Question

I want to read file that has content of around 2GB, I try to use multiprocessing pooling to do, but it am getting error:

TypeError: 'type' object is not iterable

I know that map always accept argument which is iterable, but is there way to do that? Here is my code so far:

def load_embeddings(FileName):

    #file = open(FileName,'r')
    embeddings = {}
    i = 0
    print  "Loading word embeddings first time"
    for line in FileName:
             # print line

            tokens = line.split('	')
            tokens[-1] = tokens[-1].strip()

            #each line has 400 tokens
            for i in xrange(1, len(tokens)):
                    tokens[i] = float(tokens[i])
                    embeddings[tokens[0]] = tokens[1:-1]
    print  "finished"
    return embeddings

if __name__ == "__main__":

    t1 = time.time()
    p = Pool(processes=5)
    FileName  = './asag/Resources/EN-wform.w.5.cbow.neg10.400.subsmpl.txt'
    file_ = open(FileName,'r')
    #fun = partial(load_embeddings,FileName) 
    result = p.map(load_embeddings, file_)
    p.close()
    p.join()
    print ("Time it took :" + str(time.time() - t1))

hansaplast · Accepted Answer

Your source code would be correct if it ran in a single-process-environment. Although your argument FileName should be named file as it's really a open file handle and not a filename (string).

Now, what happens is that you are giving 5 processes the same file handle to work on. With for line in FileName you doing the read operations on the file handle. And this happens in parallel in 5 different processes. All not knowing from the other ones (that's the beauty of it: for the OS these are different programs running. But they all read from the same file handle). Now, it seems that this is not atomic and this call can be interrupted after an only partial read of the line. Could also be, that python buffers internally but the buffer is per process. That results in having half lines in line or parts of the first line and parts of the second (because python just reads until it sees the first ) and then you get errors when you want to process the line further.

To fix this you'll need to read the file first in your main process and hand the lines to the map function, like this:

from multiprocessing import Pool

def load_embeddings(line):
    embeddings = {}
    i = 0
    tokens = line.split('	')
    tokens[-1] = tokens[-1].strip()

    #each line has 400 tokens
    for i in xrange(1, len(tokens)):
            tokens[i] = float(tokens[i])
            embeddings[tokens[0]] = tokens[1:-1]
    print "finished"
    return embeddings

if __name__ == "__main__":
    p = Pool(processes=5)
    file_name  = 'file.tsf'
    lines = []
    with open(file_name,'r') as f:
        for line in f:
            lines.append(line.strip())

    result = p.map(load_embeddings, lines)
    p.close()
    p.join()

How to read file using multiprocessing pooling

Answers (1)

Related Questions