Writing multiple CSV files concurrently using Threading

Question

I have a list that contains multiple dataframes. These dataframes can be quite large and take some time to write to csv files. I am trying to write them concurrently to csv files using pandas and tried to use multithreading to reduce the time. Why is the multithreading version taking more time than the sequential version? Is writing a file to csv with pandas not an IO Bound Process or am I not implementing it correctly?

Multithreading:

list_of_dfs = [df_a, df_b, df_c]

start = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = executor.map(lambda x: list_of_dfs[x].to_csv('Rough/'+str(x)+'.csv', index=False), range(0,3))
        
print(time.time()-start)
>>> 18.202364921569824

Sequential:

start = time.time()

for i in range(0,3):
    list_of_dfs[i].to_csv('Rough/'+str(i)+'.csv', index=False)
    
print(time.time() - start)
>>> 13.783314228057861

J&#233;r&#244;me Richard · Accepted Answer

I assume you use the usual CPython interpreter.

Why is the multithreading version taking more time than the sequential version?

The answers probably lies in the CPython Global Interpreter Lock (GIL).

Indeed, Pandas use the internal csv lib of CPython to write CSV files. However, AFAIK, the csv library (written in C) reads basic Python types from memory (so it is not aware of Numpy) and formats them into strings so that they can be written on your storage device. The access to CPython objects is protected by the GIL which prevent any speed-up (assuming most of the time is spent in accessing CPython objects).

Is writing a file to csv with pandas not an IO Bound Process or am I not implementing it correctly?

Writing CSV files using Pandas is clearly not IO bound on modern machines (with any decent SSD). The formatting process is very slow (integer and float conversions as well as string handling) and should takes most of the time. Moreover, it is interleaved with slow CPython object accesses. This explains why you should not get any speed-up.

Context switches between two threads often result in lower performance. You can more find information about this in the CPython documentation itself:

The GIL can degrade performance even when it is not a bottleneck. Summarizing the linked slides: The system call overhead is significant, especially on multicore hardware. Two threads calling a function may take twice as much time as a single thread calling the function twice. The GIL can cause I/O-bound threads to be scheduled ahead of CPU-bound threads, and it prevents signals from being delivered.

Writing multiple CSV files concurrently using Threading

Answers (2)

Related Questions