Improve execution time with Multithreading in python

Question

How can I implement multithreading to make this process faster? The program generates 1million random numbers and writes them on a file. It takes just over 2 seconds, but I'm wondering if multithreading would make it any faster.

import random
import time

startTime = time.time()

data = open("file2.txt", "a+")

for i in range(1000000):
  number = str(random.randint(1, 9999))
  data.write(number + '
')
data.close()

executionTime = (time.time() - startTime)
print('Execution time in seconds: ', + str(executionTime))

ypnos · Accepted Answer

The short answer: Not easily.

Here is an example of using a multiprocessing pool to speed up your code:

import random
import time
from multiprocessing import Pool

startTime = time.time()

def f(_):
    number = str(random.randint(1, 9999))
    data.write(number + '
')

data = open("file2.txt", "a+")
with Pool() as p:
    p.map(f, range(1000000))
data.close()

executionTime = (time.time() - startTime)
print(f'Execution time in seconds: {executionTime})')

Looks good? Wait! This is not a drop-in replacement as it lacks synchronization of the processes so not all 1000000 line will be written (some will be overwritten in the same buffer)! See Python multiprocessing safely writing to a file

So we need to separate computing the numbers (in parallel) from writing them (in serial). We can do this as follows:

import random
import time
from multiprocessing import Pool

startTime = time.time()

def f(_):
    return str(random.randint(1, 9999))

with Pool() as p:
    output = p.map(f, range(1000000))

with open("file2.txt", "a+") as data:
    data.write('
'.join(output) + '
')

executionTime = (time.time() - startTime)
print(f'Execution time in seconds: {executionTime})')

With that fixed, note that this is not multithreading but uses multiple processes. You can change it to multithreading with a different pool object:

from multiprocessing.pool import ThreadPool as Pool

On my system, I get from 1 second to 0.35 seconds with the processing pool. With the ThreadPool however it takes up to 2 seconds!

The reason is that Python's global interpreter lock prevents from multiple threads to process your task efficiently, see What is the global interpreter lock (GIL) in CPython?

To conclude, multithreading is not always the right answer:

In your scenario, one limit is the file access, only one thread can write to the file, or you would need to introduce locking, making any performance gain moot
Also in Python multithreading is only suitable for specific tasks, e.g. long computations that happen in a library below python and can therefore run in parallel. In your scenario, the overhead of multithreading negates the small potential for a performance benefit.

The upside: Yes, with multiprocessing instead of multithreading I got a 3x speedup on my system.

Improve execution time with Multithreading in python

Answers (2)

Code: (Execution time in seconds: 1.943817138671875)

Output:

Related Questions