tilanka kulatunge
tilanka kulatunge

Reputation: 77

Use multithreading for multiple file copies

I have to copy large number of files (10000 files)

because it takes long time to copy. I have tried using two threads instead of single thread, one to copy odd number files in list and other to copy even numbers from list

I have used this code:

ThreadPool.QueueUserWorkItem(new WaitCallback(this.RunFileCopy),object)

but there is no significant difference in time when using single thread and when using two threads.

What could be the reason for this?

Upvotes: 4

Views: 7113

Answers (2)

qwerty
qwerty

Reputation: 2085

File copying is not a CPU process, it a IO process, so multithreding or parallelism wont help you.

Multithreading will slow you down in almost all cases.If disc is SSD too it has a limited speed for r/w and it will use it efficiently with single thread too. If u use parallelism you will just split your speed into pieces and this will create a huge overhead for HDD

Multithreading only helps you in more than one disc case, when you read from different discs and write to different discs.

If files are too small. Zipping and unzipping the files on the target drive can be faster in most cases, and if u zip the files with low compression it will be quite faster

using System.IO;
using System.IO.Compression;

.....

string startPath = @"c:\example\start";
string zipPath = @"c:\example\result.zip";
string extractPath = @"c:\example\extract";

ZipFile.CreateFromDirectory(startPath, zipPath, CompressionLevel.Fastest, true);

ZipFile.ExtractToDirectory(zipPath, extractPath);

More implementation details here

How to: Compress and Extract Files

Upvotes: 9

Ira Baxter
Ira Baxter

Reputation: 95420

I'm going to provide a minority opinion here. Everybody is telling you that Disk I/O is preventing you from getting any speedup from multiple threads. That's ... sort ... of right, but...

Given a single disk request, the OS can only choose to move the heads to the point on the disk selected impliclity by the file access, usually incurring an average of half of the full stroke seek time (tens of milliseconds) and rotational delays (another 10 milliseconds) to access the data. And sticking with single disk requests, this is a pretty horrendous (and unavoidable) cost to pay.

Because disk accesses take a long time, the OS has plenty of CPU to consider the best order to access the disk when there are multiple requests, should they occur while it is already waiting for the disk to do something. The OS does so usually with an elevator algorithm, causing the heads to efficiently scan across the disk in one direction in one pass, and scan efficiently in the other direction when the "furthest" access has been reached.

The idea is simple: if you process multiple disk requests in exactly the time order in which they occur, the disk heads will likely jump randomly about the disk (under the assumption the files are placed randomly), thus incurring the helf-full seek + rotational delay on every access. With 1000 live accesses processed in order, 1000 average half-full seeks will occur. Ick.

Instead, give N near-simultaneous accesses, the OS can sort these accesses by the physical cylinder they will touch, and then process them in cylinder order. A 1000 live accesses, processed in cylinder order (even with random file distributions), is likely to have one request per cylinder. Now the heads only have to step from one cylinder to the next, and that's a lot less than the average seek.

So having lots of requests should help the OS make better access-order decisions.

Since OP has lots of files, there's no reason he could not run a lot of threads, each copying its own file and generating demand for disk locations. He would want each thread to issue a read and write of of something like a full track, so that when the heads arrive at a cylinder, a full track is read or written (under the assumption the OS lays files out contiguously on a track where it can).

OP would want to make sure his machine had enough RAM to buffer his threadcount times tracksize. An 8Gb machine with 4 Gb unbusy during the copy has essentially a 4 Gb disk cache. A 100Kb per track (been a long time since I looked) suggests "room" for 10,000 threads. I seriously doubt he needs that many; mostly he needs enough threads to overwhelm the number of cylinders on his disk. I'd surely consider several hundred threads.

Two threads surely is not enough. (Windows appears to use one thread when you ask it copy a bunch of files. That's always seemed pretty dumb to me).

Another poster suggested zipping the files. With lots of threads, and everything waiting on the disk (the elevator algorithm doesnt change that, just the average wait time), many threads can afford to throw computational cycles at zipping. This won't help with reads; files are what they are when read. But it may shorten the amount of data to write, and provide effectively larger buffers in memory, providing some additional speedup.

Note: If one has an SSD, then there are no physical cylinders, thus no seek time, and nothing for an elevator algorithm to optimize. Here, lots of threads don't buy any cylinder ordering time. They shouldn't hurt, either.

Upvotes: 1

Related Questions