Reputation: 6144
I would like to create some 50000 files using python and they are very simple files with each file with less than 20 lines in it.
I first tried adding threading just for the sake of it and it took 220 seconds on my i7 8th gen machine.
WITH THREAD
def random_files(i):
filepath = path+"/content/%s.html" %(str(i))
fileobj = open(filepath,"w+")
l1 = "---\n"
l2 = 'title: "test"\n'
l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
l4 = "draft: false"+"\n"
l5 = 'type: "statecity"'+"\n"
l6 = "---"+"\n"
data = l1+l2+l3+l4+l5+l6
fileobj.writelines(data)
fileobj.close()
if __name__ == "__main__":
start_time = time.time()
for i in range(0, 50000):
i = str(i)
threading.Thread(name='random_files', target=random_files, args=(i,)).start()
print("--- %s seconds ---" % (time.time() - start_time))
WITHOUT THREAD
Doing the non thread route takes 55 seconds.
def random_files():
for i in range(0, 50000):
filepath = path+"/content/%s.html" %(str(i))
fileobj = open(filepath,"w+")
l1 = "---\n"
l2 = 'title: "test"\n'
l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
l4 = "draft: false"+"\n"
l5 = 'type: "statecity"'+"\n"
l6 = "---"+"\n"
data = l1+l2+l3+l4+l5+l6
fileobj.writelines(data)
fileobj.close()
if __name__ == "__main__":
start_time = time.time()
random_files()
print("--- %s seconds ---" % (time.time() - start_time))
CPU usage is 10% for python task RAM usage is a meager 50mb Disk usage is an average 4.5 Mb/second
Can the speed be increased drastically.
Upvotes: 1
Views: 1038
Reputation: 1162
Try threading with the load split equally across each of the threads in your system.
This provides an almost linear speed up for the number of threads that the load is split across:
Without Threading:
~11% CPU ~5MB/s Disk
--- 69.15089249610901 seconds ---
With Threading: 4 Threads
22% CPU 13MB/s Disk
--- 29.21335482597351 seconds ---
With Threading: 8 Threads
27% CPU 15MB/s Disk
--- 20.8521249294281 seconds ---
For example:
import time
from threading import Thread
def random_files(i):
filepath = path+"/content/%s.html" %(str(i))
fileobj = open(filepath,"w+")
l1 = "---\n"
l2 = 'title: "test"\n'
l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
l4 = "draft: false"+"\n"
l5 = 'type: "statecity"'+"\n"
l6 = "---"+"\n"
data = l1+l2+l3+l4+l5+l6
fileobj.writelines(data)
fileobj.close()
def pool(start,number):
for i in range(int(start),int(start+number)):
random_files(i)
if __name__ == "__main__":
start_time = time.time()
num_files = 50000
threads = 8
batch_size = num_files/threads
thread_list = [Thread(name='random_files', target=pool, args=(batch_size * thread_index ,batch_size)) for thread_index in range(threads)]
[t.start() for t in thread_list]
[t.join() for t in thread_list] // simply required to wait for each of the threads to finish before stopping the timer
print("--- %s seconds ---" % (time.time() - start_time))
The solution provided here, however, is only an example to show the speed increase which can be achieved. The method of splitting the files into batches only works as 50,000 files can be evenly divided into 8 batches (one for each thread), a more robust solution will be required with the pool()
function to split the load into batches.
Try taking a look at this SO example of splitting an uneven load across threads for an example.
Hope this helps!
Upvotes: 1