Prabhakar Shanmugam
Prabhakar Shanmugam

Reputation: 6144

How to write large numbers of small files faster with Python threading

I would like to create some 50000 files using python and they are very simple files with each file with less than 20 lines in it.

I first tried adding threading just for the sake of it and it took 220 seconds on my i7 8th gen machine.

WITH THREAD


def random_files(i):
    filepath = path+"/content/%s.html" %(str(i))
    fileobj = open(filepath,"w+")
    l1 = "---\n"
    l2 = 'title: "test"\n'
    l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
    l4 = "draft: false"+"\n"
    l5 = 'type: "statecity"'+"\n"
    l6 = "---"+"\n"
    data = l1+l2+l3+l4+l5+l6
    fileobj.writelines(data)
    fileobj.close()

if __name__ == "__main__":
    start_time = time.time()
    for i in range(0, 50000):
        i = str(i)
        threading.Thread(name='random_files', target=random_files, args=(i,)).start()
    print("--- %s seconds ---" % (time.time() - start_time))

WITHOUT THREAD

Doing the non thread route takes 55 seconds.

def random_files():
    for i in range(0, 50000):
        filepath = path+"/content/%s.html" %(str(i))
        fileobj = open(filepath,"w+")
        l1 = "---\n"
        l2 = 'title: "test"\n'
        l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
        l4 = "draft: false"+"\n"
        l5 = 'type: "statecity"'+"\n"
        l6 = "---"+"\n"
        data = l1+l2+l3+l4+l5+l6
        fileobj.writelines(data)
        fileobj.close()

if __name__ == "__main__":
    start_time = time.time()
    random_files()
    print("--- %s seconds ---" % (time.time() - start_time))

CPU usage is 10% for python task RAM usage is a meager 50mb Disk usage is an average 4.5 Mb/second

Can the speed be increased drastically.

Upvotes: 1

Views: 1038

Answers (1)

Alessi 42
Alessi 42

Reputation: 1162

Try threading with the load split equally across each of the threads in your system.

This provides an almost linear speed up for the number of threads that the load is split across:

Without Threading:

~11% CPU ~5MB/s Disk

--- 69.15089249610901 seconds ---


With Threading: 4 Threads

22% CPU 13MB/s Disk

--- 29.21335482597351 seconds ---


With Threading: 8 Threads

27% CPU 15MB/s Disk

--- 20.8521249294281 seconds ---


For example:

import time
from threading import Thread

def random_files(i):
    filepath = path+"/content/%s.html" %(str(i))
    fileobj = open(filepath,"w+")
    l1 = "---\n"
    l2 = 'title: "test"\n'
    l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
    l4 = "draft: false"+"\n"
    l5 = 'type: "statecity"'+"\n"
    l6 = "---"+"\n"
    data = l1+l2+l3+l4+l5+l6
    fileobj.writelines(data)
    fileobj.close()

def pool(start,number):
    for i in range(int(start),int(start+number)):
        random_files(i)

if __name__ == "__main__":
    start_time = time.time()
    num_files = 50000
    threads = 8
    batch_size = num_files/threads
    thread_list = [Thread(name='random_files', target=pool, args=(batch_size * thread_index ,batch_size)) for thread_index  in range(threads)]
    [t.start() for t in thread_list]
    [t.join() for t in thread_list] // simply required to wait for each of the threads to finish before stopping the timer

    print("--- %s seconds ---" % (time.time() - start_time))

The solution provided here, however, is only an example to show the speed increase which can be achieved. The method of splitting the files into batches only works as 50,000 files can be evenly divided into 8 batches (one for each thread), a more robust solution will be required with the pool() function to split the load into batches.

Try taking a look at this SO example of splitting an uneven load across threads for an example.

Hope this helps!

Upvotes: 1

Related Questions