Read a file using threads

Question

I try to write a python program that send files from one PC to another using python's sockets. But when file size increase it takes lots of time. Is it possible to read lines of a file sequentially using threads?

The concepts which I think is as follows: Each thread separately and sequentially read lines from file and send it over socket. Is it possible to do? Or do you have any suggestion for it?

abarnert · Accepted Answer

First, if you want to speed this up as much as possible without using threads, reading and sending a line at a time can be pretty slow. Python does a great job of buffering up the file to give you a line at a time for reading, but then you're sending tiny 72-byte packets over the network. You want to try to send at least 1.5KB at a time when possible.

Ideally, you want to use the sendfile method. Python will tell the OS to send the whole file over the socket in whatever way is most efficient, without getting your code involved at all. Unfortunately, this doesn't work on Windows; if you care about that, you may want to drop to the native APIs¹ directly with pywin32 or switch to a higher-level networking library like twisted or asyncio.

Now, what about threading?

Well, reading a line at a time in different threads is not going to help very much. The threads have to read sequentially, fighting over the read pointer (and buffer) in the file object, and they presumably have to write to the socket sequentially, and you probably even need a mutex to make sure they write things in order. So, whichever one of those is slowest, all of your threads are going to end up waiting for their turn.²

Also, even forgetting about the sockets: Reading a file in parallel can be faster in some situations on modern hardware, but in general it's actually a lot slower. Imagine the file is on a slow magnetic hard drive. One thread is trying to read the first chunk, the next thread is trying to read the 64th chunk, the next thread is trying to read the 4th chunk… this means you spend more time seeking the disk head back and forth than actually reading data.

But, if you think you might be in one of those situations where parallel reads might help, you can try it. It's not trivial, but it's not that hard.

First, you want to do binary reads of fixed-size chunks. You're going to need to experiment with different sizes—maybe 4KB is fastest, maybe 1MB… so make sure to make it a constant you can easily change in just one place in the code.

Next, you want to be able to send the data as soon as you can get it, rather than serializing. This means you have to send some kind of identifier, like the offset into the file, before each chunk.

The function will look something like this:

def sendchunk(sock, lock, file, offset):
    with lock:
        sock.send(struct.pack('>Q', offset)
        sent = sock.sendfile(file, offset, CHUNK_SIZE)
        if sent < CHUNK_SIZE:
            raise OopsError(f'Only sent {sent} out of {CHUNK_SIZE} bytes')

… except that (unless your files actually are all multiples of CHUNK_SIZE) you need to decide what you want to do for a legitimate EOF. Maybe send the total file size before any of the chunks, and pad the last chunk with null bytes, and have the receiver truncate the last chunk.

The receiving side can then just loop reading 8+CHUNK_SIZE bytes, unpacking the offset, seeking, and writing the bytes.

_{1. See TransmitFile—but in order to use that, you have to know about how to go between Python-level socket objects and Win32-level HANDLEs, and so on; if you've never done that, there's a learning curve—and I don't know of a good tutorial to get you started..}

_{2. If you're really lucky, and, say, the file reads are only twice as fast as the socket writes, you might actually get a 33% speedup from pipelining—that is, only one thread can be writing at a time, but the threads waiting to write have mostly already done their reading, so at least you don't need to wait there.}

Read a file using threads

Answers (2)

Related Questions