Relax ZeroC
Relax ZeroC

Reputation: 683

About using multiprocessing to read file

I have many files in the folders,so I think I should use multiprocess , then I use multiprocess to read txt file in the folder, But I compare the time if I used multiprocess or not , I found if I don't use pool is more fast. I don't know why , so what situation should I use Pool to read file( huge files?)

using Pool
time:0.5836s
not using Pool
time:0.0076s

the code is ,

import pandas as pd
from multiprocessing import Pool
import glob2,os,time

class PandasReadFile:

    def __init__(self):
        print('123')

    def readFilePool(self,path):
        n,t=0,time.time()
        print(t)
        pp = Pool(processes=1)

        # here is using pool
        df = pd.concat(pp.map(self.read_csv, glob2.iglob(os.path.join(path, "*.txt"))))
        # not using pool
        # df = pd.concat(map(pd.read_csv, glob2.iglob(os.path.join(path, "*.txt"))))
        t = time.time() - t
        print('%.4fs' % (t))
        print(df)

    @staticmethod
    def read_csv(filename):
        return pd.read_csv(filename)

if __name__ == '__main__':
    p = PandasReadFile()
    p.readFilePool('D:/')

Upvotes: 0

Views: 551

Answers (1)

Laurent LAPORTE
Laurent LAPORTE

Reputation: 22992

You can spawn as many processes as you want but since you work on the same hard drive, you won't reduce time. Worse: you will loose time.

You can use multiprocessing for CPU-intensive tasks, not for IO-intensive tasks.

You may reduce time with two processes if you copy files from one drive to another. It may also work with mounted network drives (NAS).

Upvotes: 1

Related Questions