Reputation: 683
I have many files in the folders,so I think I should use multiprocess , then I use multiprocess to read txt file in the folder, But I compare the time if I used multiprocess or not , I found if I don't use pool is more fast. I don't know why , so what situation should I use Pool to read file( huge files?)
using Pool
time:0.5836s
not using Pool
time:0.0076s
the code is ,
import pandas as pd
from multiprocessing import Pool
import glob2,os,time
class PandasReadFile:
def __init__(self):
print('123')
def readFilePool(self,path):
n,t=0,time.time()
print(t)
pp = Pool(processes=1)
# here is using pool
df = pd.concat(pp.map(self.read_csv, glob2.iglob(os.path.join(path, "*.txt"))))
# not using pool
# df = pd.concat(map(pd.read_csv, glob2.iglob(os.path.join(path, "*.txt"))))
t = time.time() - t
print('%.4fs' % (t))
print(df)
@staticmethod
def read_csv(filename):
return pd.read_csv(filename)
if __name__ == '__main__':
p = PandasReadFile()
p.readFilePool('D:/')
Upvotes: 0
Views: 551
Reputation: 22992
You can spawn as many processes as you want but since you work on the same hard drive, you won't reduce time. Worse: you will loose time.
You can use multiprocessing for CPU-intensive tasks, not for IO-intensive tasks.
You may reduce time with two processes if you copy files from one drive to another. It may also work with mounted network drives (NAS).
Upvotes: 1