About using multiprocessing to read file

Question

I have many files in the folders,so I think I should use multiprocess , then I use multiprocess to read txt file in the folder, But I compare the time if I used multiprocess or not , I found if I don't use pool is more fast. I don't know why , so what situation should I use Pool to read file( huge files?)

using Pool
time:0.5836s
not using Pool
time:0.0076s

the code is ,

import pandas as pd
from multiprocessing import Pool
import glob2,os,time

class PandasReadFile:

    def __init__(self):
        print('123')

    def readFilePool(self,path):
        n,t=0,time.time()
        print(t)
        pp = Pool(processes=1)

        # here is using pool
        df = pd.concat(pp.map(self.read_csv, glob2.iglob(os.path.join(path, "*.txt"))))
        # not using pool
        # df = pd.concat(map(pd.read_csv, glob2.iglob(os.path.join(path, "*.txt"))))
        t = time.time() - t
        print('%.4fs' % (t))
        print(df)

    @staticmethod
    def read_csv(filename):
        return pd.read_csv(filename)

if __name__ == '__main__':
    p = PandasReadFile()
    p.readFilePool('D:/')

Laurent LAPORTE · Accepted Answer

You can spawn as many processes as you want but since you work on the same hard drive, you won't reduce time. Worse: you will loose time.

You can use multiprocessing for CPU-intensive tasks, not for IO-intensive tasks.

You may reduce time with two processes if you copy files from one drive to another. It may also work with mounted network drives (NAS).

About using multiprocessing to read file

Answers (1)

Related Questions