ring0
ring0

Reputation: 811

Improve write speed for high speed file copy?

I've been trying to find out the fastest way to code a file copy routine to copy a large file onto a RAID 5 hardware.

The average file size is around 2 GB.

There are 2 windows boxes (both running win2k3). The first box is the source, where is the large file is located. And the second box has a RAID 5 storage.

http://blogs.technet.com/askperf/archive/2007/05/08/slow-large-file-copy-issues.aspx

The above link clearly explains why windows copy, robocopy and other common copy utilities suffer in write performance. Hence, i've written a C/C++ program that uses CreateFile, ReadFile & WriteFile API's with NO_BUFFERING & WRITE_THROUGH flags. The program simulates ESEUTIL.exe, in the sense, it uses 2 threads, one for reading and one for writing. The reader thread reads 256 KB from source and fills a buffer. Once 16 such 256 KB blocks are filled, the writer thread writes the contents in the buffer to the destination file. As you can see, the writer thread writes 8MB of data in 1 shot. The program allocates 32 such 8MB blocks... hence, the writing and reading can happen in parallel. Details of ESEUtil.exe can be found in the above link. Note: I am taking care of the data alignment issues when using NO_BUFFERING.

I used bench marking utilities like ATTO and found out that our RAID 5 hardware has a write speed of 44MB per second when writing 8MB data chunk. Which is around 2.57 GB per minute.

But my program is able to achieve only 1.4 GB per minute.

Can anyone please help me identify what the problem is? Are there faster API's other that CreateFile, ReadFile, WriteFile available?

Upvotes: 6

Views: 10998

Answers (7)

Len Holgate
Len Holgate

Reputation: 21616

A while back I wrote a blog posting about async file I/O and how it often tends to actually end up being synchronous unless you do everything just right (http://www.lenholgate.com/blog/2008/02/when-are-asynchronous-file-writes-not-asynchronous.html).

The key points are that even when you're using FILE_FLAG_OVERLAPPED and FILE_FLAG_NO_BUFFERING you still need to pre-extend the file so that your async writes don't need to extend the file as they go; for security reasons file extension is always synchronous. To pre-extend you need to do the following:

  • Enable the SE_MANAGE_VOLUME_NAME privilege.
  • Open the file.
  • Seek to the desired file length with SetFilePointerEx().
  • Set the end of file with SetEndOfFile().
  • Set the end of the valid data within the file SetFileValidData().
  • Close the file.

Then...

  • Open the file to write.
  • Issue the writes

Upvotes: 3

ring0
ring0

Reputation: 811

I did some tests and have some results. The tests were performed on 100Mbps & 1Gbps NIC. The source machine is Win2K3 server (SATA) and the target machine is Win2k3 server (RAID 5).

I ran 3 tests:

1) Network Reader -> This program just reads files across the network. The purpose of the program is to find the maximum n/w read speed. I am performing a NON BUFFERED reads using CreateFile & ReadFile.

2) Disk Writer -> This program benchmarks the RAID 5 speed by writing data. NON BUFFERED writes are performed using CreateFile & WriteFile.

3) Blitz Copy -> This program is the file copy engine. It copies files across the network. The logic of this program was discussed in the initial question. I am using synchronous I/O with NO_BUFFERING Reads & Writes. The APIs used are CreateFile, ReadFile & WriteFile.


Below are the results:

NETWORK READER:-

100 Mbps NIC

Took 148344 ms to read 768 MB with chunk size 8 KB.

Took 89359 ms to read 768 MB with chunk size 64 KB

Took 82625 ms to read 768 MB with chunk size 128 KB

Took 79594 ms to read 768 MB with chunk size 256 KB

Took 78687 ms to read 768 MB with chunk size 512 KB

Took 79078 ms to read 768 MB with chunk size 1024 KB

Took 78594 ms to read 768 MB with chunk size 2048 KB

Took 78406 ms to read 768 MB with chunk size 4096 KB

Took 78281 ms to read 768 MB with chunk size 8192 KB

1 Gbps NIC

Took 206203 ms to read 5120 MB (5GB) with chunk size 8 KB

Took 77860 ms to read 5120 MB with chunk size 64 KB

Took 74531 ms to read 5120 MB with chunk size 128 KB

Took 68656 ms to read 5120 MB with chunk size 256 KB

Took 64922 ms to read 5120 MB with chunk size 512 KB

Took 66312 ms to read 5120 MB with chunk size 1024 KB

Took 68688 ms to read 5120 MB with chunk size 2048 KB

Took 64922 ms to read 5120 MB with chunk size 4096 KB

Took 66047 ms to read 5120 MB with chunk size 8192 KB

DISK WRITER:-

Write performed on RAID 5 With NO_BUFFERING & WRITE_THROUGH

Writing 2048MB (2GB) of data with chunk size 4MB took 68328ms.

Writing 2048MB of data with chunk size 8MB took 55985ms.

Writing 2048MB of data with chunk size 16MB took 49569ms.

Writing 2048MB of data with chunk size 32MB took 47281ms.

Write performed on RAID 5 With NO_BUFFERING only

Writing 2048MB (2GB) of data with chunk size 4MB took 57484ms.

Writing 2048MB of data with chunk size 8MB took 52594ms.

Writing 2048MB of data with chunk size 16MB took 49125ms.

Writing 2048MB of data with chunk size 32MB took 46360ms.

Write performance degrades linearly as the chunk size reduces. And WRITE_THROUGH flag introduces some performance hit

BLITZ COPY:-

1 Gbps NIC, Copying 60 GB of files with NO_BUFFERING

Time Taken to complete copy : 2236735 ms. Ie, 37.2 mins. The speed is ~ 97 GB / per.

100 Mbps NIC, Copying 60 GB of files with NO_BUFFERING

Time Taken to complete copy : 7337219 ms. Ie, 122 mins. The speed is ~ 30 GB / per.

I did try using 10-FileCopy program by Jeffrey Ritcher that uses Async-IO with NO_BUFFERING. But, the results were poor. I guess the reason could be the chunk size is 256 KB... 256 KB write on RAID 5 is terribly slow.

Comparing with robocopy:

100 Mbps NIC : Blitz Copy and robocopy perform @ ~30 GB per hour.

1 GBps NIC : Blitz Copy goes @ ~97 GB per hour while robocopy @ ~50 GB per hour.

Upvotes: 0

John Knoeller
John Knoeller

Reputation: 34128

You should use async IO to get the best performance. That is opening the file with FILE_FLAG_OVERLAPPED and using the LPOVERLAPPED argument of WriteFile. You may or may not get better performance with FILE_FLAG_NO_BUFFERING. You will have to test to see.

FILE_FLAG_NO_BUFFERING will generally give you more consistent speeds and better streaming behavior, and it avoids polluting your disk cache with data that you may not need again, but it isn't necessarily faster overall.

You should also test to see what the best size is for each block of IO. In my experience There is a huge performance difference between copying a file 4k at a time and copying it 1Mb at a time.

In my past testing of this (a few years ago) I found that block sizes below about 64kB were dominated by overhead, and total throughput continued to improve with larger block sizes up to about 512KB. I wouldn't be surprised if with today's drives you needed to use block sizes larger than 1MB to get maximum throughput.

The numbers you are currently using appear to be reasonable, but may not be optimal. Also I'm fairly certain that FILE_FLAG_WRITE_THROUGH prevents the use of the on-disk cache and thus will cost you a fair bit of performance.

You need to also be aware that copying files using CreateFile/WriteFile will not copy metadata such as timestamps or alternate data streams on NTFS. You will have to deal with these things on your own.

Actually replacing CopyFile with your own code is quite a lot of work.

Addendum:

I should probably mention that when I tried this with software Raid 0 on WindowsNT 3.0 (about 10 years ago). The speed was VERY sensitive to the alignment in memory of the buffers. It turned out that at the time, the SCSI drivers had to use a special algorithm for doing DMA from a scatter/gather list, when the DMA was more than 16 physical regions of memory (64Kb). To get guranteed optimal performance required physically contiguous allocations - which is something that only drivers can request. This was basically a workaround for a bug in the DMA controller of a popular chipset back then, and is unlikely to still be an issue.

BUT - I would still strongly suggest that you test ALL power of 2 block sizes from 32kb to 32Mb to see which is faster. And you might consider testing to see if some buffers are consistently faster than others - it's not unheard of.

Upvotes: 7

Foredecker
Foredecker

Reputation: 7493

The right way to do this is with un-buffered fully asynchronous I/O. You will want to issue multiple I/Os to keep a queue going. This lets the file system, driver, and Raid-5 sub-system more optimally mange the I/Os.

You can also open multiple files and issue read and wites to multiple files.

NOTE! The optimal number of outstanding I/Os and how you interleave reads and writes will depend greatly on the storage sub-system itself. Your program will need to be highly paramterized so you can tune it.

Note - I belive that Robocopy has been improved - have you tried it? I

Upvotes: 0

ring0
ring0

Reputation: 811

If write speed is that important, why not consider RAID 0 for your hardware configuration?

  • The customer wants RAID 5.
  • Preferred over RAID 0 because of better fault tolerance.
  • The customer is satisfied with what RAID 5 can offer. The question here is benchmarking the hardware using ATTO shows a write speed of 2.57 GB per minute (8MB chunk write), why cant a copy tool achieve close to it ? Something like 2 GB per min is what we are looking at. We've been able to achieve only ~1.5 GB per min so far.

Upvotes: 0

Thomas Matthews
Thomas Matthews

Reputation: 57678

Just remember that a hard disk buffers data coming from the platters and going to the platters. Most disk drives will try to optimize the read requests to keep the platters rotating and minimize head movement. The drives try to absorb as much data from the Host before writing to the platters so that the Host can be disconnected as soon as possible.

Your performance also depends on the I/O bus traffic on the PC as well as the traffic between the disk and the host. There are other alternative factors to consider such as system tasks and programs running "at the same time". You may not be able to achieve the exact performance as your measuring tool. And remember that these timings have a error factor due to the above mentioned overheads.

If your platform has DMA controllers, try using these.

Upvotes: 0

RickNZ
RickNZ

Reputation: 18654

How fast can you read the source file if you don't write the destination?

Is the source file fragmented? Fragmented reads can be an order of magnitude slower than contiguous reads. You can use the "contig" utility to make it contiguous:

http://technet.microsoft.com/en-us/sysinternals/bb897428.aspx

How fast is the network connecting the two machines?

Have you tried just writing dummy data, without reading it first, like ATTO does?

Do you have more than one read or write request in flight at a time?

What's the stripe size of your RAID-5 array? Writing a full stripe at a time is the fastest way to write to RAID-5.

Upvotes: 0

Related Questions