Matt
Matt

Reputation: 1632

Does multiprocessing speed up file transfers compared to multithreading

I am writing a script to simultaneously accept many files transfers from many computers on a subnet using sockets (around 40 jpg files total). I want to use multithreading or multiprocessing to make the the transfer occur as fast as possible.

I'm wondering if this type of image transfer is limited by the CPU - and therefore I should use multiprocessing - or if multithreading will be just as good here.

I would also be curious as to what types of activities are limited by the CPU and require multiprocessing, and which are better suited for multithreading.

Upvotes: 0

Views: 1697

Answers (3)

Timmy_A
Timmy_A

Reputation: 1232

If your file transfer isn't extremely slow - slower than writing data to disk, multithreading/multiprocessing isn't going to help. By file transfer I mean downloading images and writing them to the local computer with a single HDD.

Using multithreading or multiprocessing when transferring data from several computers with separate disks definitely can improve overall download performance. Simply data read from several physical disks can be read in paralel. The problem arises when you try to save these images to your local drive.

You have just a single local HDD (if disk array not used), single HDD like most HW devices can do just a single IO operation at time. So trying to write several images to disk in the same time won't improve the overal performance - it can even hamper it.

Just imagine that 40 already downloaded images are trying to be written to a single mechanical HDD with single HDD head to different locations (different physical files) especially if disk is fragmented. Then this can even slow down the whole process because HDD is wasting time moving it magnetic head from one position to different (drives can partially mitigate this by reordering IO operation to limit head movement).

On the other hand if you do some preprocessing with these images that is CPU intensive and just then you are going to save them to disk, multithreading can be really helpful.

And to the question what's preferred. On modern OSs there is not a significant difference between using multithreading and multiprocessing (spanning multiple processes). OSs like Linux or Windows schedule threads not processes - based on process and thread priorities. So there is not a big difference between 40 single threaded processes and a single process containing 40 threads. Using multiple processes normally consumes more memory because OS for every process has to allocate some extra memory (not big), but from point of speed difference between multithreading and multiprocessing is not significant. There are other important question to consider which method to use (will these downloads share some data - like common GUI interface - multithreading is easier to use), (are these files to download so big that 40 transfers can exhaust all virtual address space of a single process - use multiprocessing).

Generally:

Multithreading - easier to use in single application because all threads share virtual address space of a single process and can easily communicate with each other. On the other hand single process has a limited size of virtual address space (less than 4GB on 32bit computer).

Multiprocessing - harder to use in a single application (a need of inter-process communication), but more scalable and more robust (if file transfer process crashes only a single file transfer fails) + more virtual address space to use.

Upvotes: 0

Jo Shinhaeng
Jo Shinhaeng

Reputation: 36

Short answer: Generally, it really depends on your workload. If you're serious on the performance, please provide details. for example, whether you store images to disk, whether image sizes are > 1GB or not, and etc.

Note: Generally again, if it not mission-critical, both ways are acceptable since we can easily switch between multithread and multiprocess implementations using threading.Thread and multiprocessing.Process.

some more comments It seems that not CPU but IO will be the bottleneck.

For multiprocess / multithread, due to GIL and/or your implementation, we may have performance difference. You may implement both ways and make try. BTW, IMHO it won't differ much. I think that async IO vs blocking IO will have greater impact.

Upvotes: 1

Jeremy Friesner
Jeremy Friesner

Reputation: 73181

If the following assumptions are true:

  1. Your script is simply receiving data from the network and writing that data to disk (more or less) verbatim, i.e. it isn't doing any expensive processing on the data
  2. Your script is running on a modern CPU with typical modern networking hardware (e.g. gigabit Ethernet or slower)
  3. Your script's download routines are not grossly inefficient (e.g. you are receiving reasonably-sized chunks of data and not just 1 byte at a time or something silly like that)

... then it's unlikely that your download rate will be CPU-limited. More likely the bottleneck will be either network bandwidth or disk I/O bandwidth.

In any case, since AFAICT your use-case is embarrassingly parallel (i.e. the various downloads never have to communicate or interact with each other, they just each do their own thing independently), it's unlikely that using multithreading vs multiprocessing will make much difference in terms of performance. Of course, the only way to be certain is to try it both ways and measure the throughput each way.

Upvotes: 1

Related Questions