Reputation: 194
I'm writing a multithreaded decompressor in python. Each thread needs to access a different chunk of the input file.
Note 1: it's not possible to load the whole file, as it ranges from 15 Gb to 200 Gb; I'm not using multithreading to speed up data read, but data decompression, I just want to make sure data read does not slow down decompression.
Note 2: the GIL is not a problem, here, as the main decompressor function is a C extension and it calls Py_ALLOW_THREADS, so that the GIL is released while decompressing. The second stage decompression uses numpy which is also GIL-free.
1) I assumed it would NOT work to simply share a Decompressor object (which basically wraps a file object), since if thread A calls the following:
decompressor.seek(x)
decompressor.read(1024)
and thread B does the same, thread A might end up reading from thread B offset. Is this correct?
2) Right now I'm simply making every thread create its own Decompressor instance and it seems to work, but I'm not sure it is the best approach. I considered these possibilities:
Add something like
seekandread(from_where, length)
to the Decompressor class which acquires a lock, seeks, reads and releases the lock;
Create a thread which waits for read requests and executes them in the correct order.
So, am I missing an obvious solution? Is there a significant performance difference between these methods?
Thanks
Upvotes: 7
Views: 7078
Reputation: 21318
You could use mmap. See mmap() vs. reading blocks
As Tim Cooper notes, mmap is a good idea when you have random access (multiple threads would make it seem like you have this), and they would be able to share the same physical pages.
Upvotes: 2
Reputation: 414905
CPython has GIL so multiple threads do not increase performance for CPU-bound tasks.
If the problem is not IO-bound (disk provides/stores data faster than CPU decompresses it) you could use multiprocessing module: each process opens the file and decompresses a given byte range.
Upvotes: 1
Reputation: 19201
You might want to use the Leader/Follower pattern if you are not doing that already.
The Leader thread will know which segments are already being handled and which are not and will assign itself with the next unprocessed segment and then become a follower, leaving the leadership to the next available thread in the pool.
Upvotes: 2