Hertha BSC fan
Hertha BSC fan

Reputation: 77

Python - multiprocessing and text file processing

BACKGROUND: I have a huge file .txt which I have to process. It is a data mining project. So I've split it to many .txt files each one 100MB size, saved them all in the same directory and managed to run them this way:

from multiprocessing.dummy import Pool

for filename in os.listdir(pathToFile):
    if filename.endswith(".txt"):
       process(filename)
    else:
       continue

In Process, I parse the file into a list of Objects, then I apply another function. This is SLOWER than running the whole file AS IS. But for big enough files I won't be able to run at once and I will have to slice. So I want to have threads as I don't have to wait for each process(filename) to finish.

How can I apply it ? I've checked this but I didn't understand how to apply it to my code...

Any help please would be appreciated. I looked here to see how to do this. What I've tried:

pool = Pool(6)
for x in range(6):
    futures.append(pool.apply_async(process, filename))

Unfortunately I realized it will only do the first 6 text files, or will it not ? How can I make it work ? as soon as a thread is over, assign to it another file text and start running.

EDIT:

for filename in os.listdir(pathToFile):
    if filename.endswith(".txt"):
       for x in range(6):
           pool.apply_async(process(filename))
    else:
       continue

Upvotes: 1

Views: 6563

Answers (1)

mata
mata

Reputation: 69052

First, using multiprocessing.dummy will only give you a speed increase if your problem is IO bound (when reading the files is the main bottleneck), for CPU intensive tasks (processing the file is the bottleneck) it won't help, in which case you should use "real" multiprocessing.

The problem you describe seems more fit for the use of one of the map functions of Pool:

from multiprocessing import Pool
files = [f for f in os.listdir(pathToFile) if f.endswith(".txt")]
pool = Pool(6)
results = pool.map(process, files)
pool.close()

This will use 6 worker processes to process the list of files and return a list of the return values of the process() function after all files have been processed. Your current example would submit the same file 6 times.

Upvotes: 4

Related Questions