Matt
Matt

Reputation: 1347

Writing multiple files slows down after x seconds

I have code which gets frames from a camera and then saves it to disk. The structure of the code is: multiple threads malloc and copy their frames into new memory, enqueue memory. Finally, another thread removes frames from queue and writes them (using ffmpeg API, raw video no compression) to their files (actually I'm using my own memory pool so malloc is only called when I need more buffers). I can have upto 8 files/cams open at the same time enqueing.

The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.

I have a 8 core, 16GB RAM Windows 7 64 bit computer (NTFS, lots of free space in second disk drive). The disk is supposed to be able to write upto 6Gbits/sec. To save my data in time I need to be able to write at 50 MB/sec. I tested disk speed with "PassMark PerformanceTest" and I had 8 threads writing files simultaneously exactly like ffmpeg saves files (synchronized, uncached I/O) and it was able to achieve 100MB/sec. So why isn't my writes able to achieve that?

Here's how the ffmpeg writes look from Process monitor logs:

 
Time of Day     Operation   File#   Result  Detail
2:30:32.8759350 PM  WriteFile   8   SUCCESS Offset: 749,535,120, Length: 32,768
2:30:32.8759539 PM  WriteFile   8   SUCCESS Offset: 749,567,888, Length: 32,768
2:30:32.8759749 PM  WriteFile   8   SUCCESS Offset: 749,600,656, Length: 32,768
2:30:32.8759939 PM  WriteFile   8   SUCCESS Offset: 749,633,424, Length: 32,768
2:30:32.8760314 PM  WriteFile   8   SUCCESS Offset: 749,666,192, Length: 32,768
2:30:32.8760557 PM  WriteFile   8   SUCCESS Offset: 749,698,960, Length: 32,768
2:30:32.8760866 PM  WriteFile   8   SUCCESS Offset: 749,731,728, Length: 32,768
2:30:32.8761259 PM  WriteFile   8   SUCCESS Offset: 749,764,496, Length: 32,768
2:30:32.8761452 PM  WriteFile   8   SUCCESS Offset: 749,797,264, Length: 32,768
2:30:32.8761629 PM  WriteFile   8   SUCCESS Offset: 749,830,032, Length: 32,768
2:30:32.8761803 PM  WriteFile   8   SUCCESS Offset: 749,862,800, Length: 32,768
2:30:32.8761977 PM  WriteFile   8   SUCCESS Offset: 749,895,568, Length: 32,768
2:30:32.8762235 PM  WriteFile   8   SUCCESS Offset: 749,928,336, Length: 32,768, Priority: Normal
2:30:32.8762973 PM  WriteFile   8   SUCCESS Offset: 749,961,104, Length: 32,768
2:30:32.8763160 PM  WriteFile   8   SUCCESS Offset: 749,993,872, Length: 32,768
2:30:32.8763352 PM  WriteFile   8   SUCCESS Offset: 750,026,640, Length: 32,768
2:30:32.8763502 PM  WriteFile   8   SUCCESS Offset: 750,059,408, Length: 32,768
2:30:32.8763649 PM  WriteFile   8   SUCCESS Offset: 750,092,176, Length: 32,768
2:30:32.8763790 PM  WriteFile   8   SUCCESS Offset: 750,124,944, Length: 32,768
2:30:32.8763955 PM  WriteFile   8   SUCCESS Offset: 750,157,712, Length: 32,768
2:30:32.8764072 PM  WriteFile   8   SUCCESS Offset: 750,190,480, Length: 4,104
2:30:32.8848241 PM  WriteFile   4   SUCCESS Offset: 750,194,584, Length: 32,768
2:30:32.8848481 PM  WriteFile   4   SUCCESS Offset: 750,227,352, Length: 32,768
2:30:32.8848749 PM  ReadFile    4   END OF FILE Offset: 750,256,128, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
2:30:32.8848989 PM  WriteFile   4   SUCCESS Offset: 750,260,120, Length: 32,768
2:30:32.8849157 PM  WriteFile   4   SUCCESS Offset: 750,292,888, Length: 32,768
2:30:32.8849319 PM  WriteFile   4   SUCCESS Offset: 750,325,656, Length: 32,768
2:30:32.8849475 PM  WriteFile   4   SUCCESS Offset: 750,358,424, Length: 32,768
2:30:32.8849637 PM  WriteFile   4   SUCCESS Offset: 750,391,192, Length: 32,768
2:30:32.8849880 PM  WriteFile   4   SUCCESS Offset: 750,423,960, Length: 32,768, Priority: Normal
2:30:32.8850400 PM  WriteFile   4   SUCCESS Offset: 750,456,728, Length: 32,768
2:30:32.8850727 PM  WriteFile   4   SUCCESS Offset: 750,489,496, Length: 32,768, Priority: Normal

This looks very efficient, however, from DiskMon the actual disk writes look ridiculously fragmented back and forth writing which may account for this slow speed. See the graph for the write speed according to this data (~5MB/s).

 TIme  Write duration  Sector      Length  MB/sec
95.6    0.00208855  1490439632  896 0.409131784
95.6    0.00208855  1488197000  128 0.058447398
95.6    0.00009537  1482323640  128 1.279965529
95.6    0.00009537  1482336312  768 7.679793174
95.6    0.00009537  1482343992  384 3.839896587
95.6    0.00009537  1482350648  768 7.679793174
95.6    0.00039101  1489278984  1152    2.809730729
95.6    0.00039101  1489393672  896 2.185346123
95.6    0.0001812   1482349368  256 1.347354443
95.6    0.0001812   1482358328  896 4.715740549
95.6    0.0001812   1482370616  640 3.368386107
95.6    0.0001812   1482378040  256 1.347354443
95.6    0.00208855  1488197128  384 0.175342193
95.6    0.00208855  1488202512  640 0.292236989
95.6    0.00208855  1488210320  1024    0.467579182
95.6    0.00009537  1482351416  256 2.559931058
95.6    0.00009537  1482360120  896 8.959758703
95.6    0.00009537  1482371896  640 6.399827645
95.6    0.00009537  1482380088  256 2.559931058
95.7    0.00039101  1489394568  1152    2.809730729
95.7    0.00039101  1489396744  352 0.858528834
95.7    0.00039101  1489507944  544 1.326817289
95.7    0.0001812   1482378296  768 4.042063328
95.7    0.0001812   1482392120  768 4.042063328
95.7    0.0001812   1482400568  512 2.694708885
95.7    0.00208855  1488224144  768 0.350684386
95.7    0.00208855  1488232208  384 0.175342193

enter image description here

I'm pretty confident it's not my code, because I timed everything and for example enqueing takes a few us suggesting that threads don't get stuck waiting for each other. It must be the disk writes. So the question is how can I improve my disk writes and what can I do to profile actual disk writes (remember that I rely on FFmpeg dlls to save so I cannot access the low level writing functions directly). If I cannot figure it out, I'll dump all the frames in a single sequential binary file (which should increase I/O speed) and then split it into video files post processing.

I don't know how much my disk I/O is caching (CacheSet only shows disk C cache size), but the following image from the performance monitor taken at 0 and 45 sec into the video (just before my queue starts piling up) looks weird to me. Basically, the modified set and standby set grew from very little to this large value. Is that the data being cached? Is it possible that only at 45 sec data is starting to be written to disk so suddenly everything slows down? enter image description hereenter image description here (FYI, LabVIEW is the program that loads my dll.)

I'll appreciate any help.
M.

Upvotes: 9

Views: 2269

Answers (4)

Damon
Damon

Reputation: 70166

I think it's finally a consequence of using an "unwise" write mode.

Why does it work with 8 threads writing in your synthetic test? Well, because in that case, little does it matter if a thread blocks. You're pushing data towards the drive full throttle. It's not surprising that you get 100 MB/s that way, which is roughly the reported real-life speed of the drive that you named.

What happens in the "real" program? You have 8 producers and a single consumer that pushes data to disk. You use non-buffered, synchronized writes. Which means as much as for as long as it takes for the drive to receive all data, your writer thread blocks, it doesn't do anything. Producers keep producing. Well that is fine, says you. You only need 50 MB/s or such, and the drive can do 100 MB/s.
Yes, except... that's the maximum with several threads hammering the drive concurrently. You have zero concurrency here, so you don't get those 100 MB/s already. Or maybe you do, for a while.

And then, the drive needs to do something. Something? Something, anything, whatever, say, a seek. Or, you get scheduled out for an instant, or for whatever reason the controller or OS or whatever doesn't push data for a tenth millisecond, and when you look again next, the disk has spun on, so you need to wait one rotation (no joke, that happens!). Maybe there's just one or two sectors in the way on a fragmented disk, who knows. It doesn't matter what... something happens.
This takes, say, anywhere from one to ten milliseconds, and your thread does nothing during that time, it's still waiting for the blocking WriteFile call to finish already. Producers keep producing and keep pulling memory blocks from the pool. The pool runs out of memory blocks, and allocates some more. Producers keep producing, and producing. Meanwhile, your writer thread keeps the buffer locked, and still does nothing.

See where this is going? That cannot be sustained forever. Even if it temporarily recovers, it's always fighting uphill. And losing.

Caching (also write caching) doesn't exist for no reason. Things in the real world do not always go as fast as we wish, both in terms of bandwidth and latency.

On the internet, we can easily crank up bandwidth to gigabit uplinks, but we are bound by latency, and there is absolutely nothing that we can do about it. That's why trans-atlantic communications still suck like they did 30 years ago, and they will still suck in 30 years (the speed of light very likely won't change).

Inside a computer, things are very similar. You can easily push two-digit gigabytes per second over a PCIe bus, but doing so in the first place takes a painfully long time. Graphics programmers know that only too well.

The same applies to disks, and it is no surprise that exactly the same strategy is applied to solve the issue: Overlap transfers, do them asynchronously. That's what virtually every operating system does by default. Data is copied to buffers, and written out lazily, and asynchronously.

That, however, is exactly what you disable by using unbuffered writes. Which is almost always (with very, very few singular exceptions) a bad idea.

For some reason, people have kept around this mindset of DMA doing magic things and being so totally superior speed-wise. Which, maybe, was even true at some point in the distant past, and maybe still is true for some very rare corner cases (e.g. read-once from optical disks). In all other cases, it's a desastrous anti-optimization. Things that could run in parallel run sequentially. Latencies add up. Threads block, and do nothing.

Simply using "normal" writes means data is copied to buffers, and WriteFile returns almost instantly (well, with a delay equivalent to a memcpy). The drive never (well, hopefully, can never be 100% sure) starves, and your writer thread never blocks, doing nothing. Memory blocks don't stack up in the queue, and pools don't run out, they don't need to allocate more.

Upvotes: 0

Chris
Chris

Reputation: 1

I am copying a directory with 100,000+ files to an exFat disk. The copy starts at around 60 MB/s and degrades to about 1-2 Mb/s. For me the quick & dirty fix is to put the computer to sleep during the operation and then wake it up. The original speed returns instantly. Not very sophisticated but it works for me.

Upvotes: 0

vine'th
vine'th

Reputation: 5030

The problem is with repeated malloc and free which puts a load on the system. I suggest creating a buffer pools, i.e allocate N buffers in the initialization stage and reuse them instead of mallocing and freeing the memory. Since you have mentioned ffmpeg, to give an example from multimedia, In gstreamer, buffer management occurs in the form of buffer-pools and in a gstreamer pipeline buffers are usually taken and passed around from buffer pools. Most multimedia systems do this.

Regarding:

The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.

The application is trashing at this point. Calling malloc at this point will make matters even worse. I suggest implementing a producer-consumer model, where one of them gets waits depending on the case. In your case, set up a threshold of N buffers. If there are N buffers in the queue, new frames from camera are not enqueued till the existing buffers are processed.

Another idea, Instead of writing raw frames why not write encoded data? assuming you a want video, you can at least write a elementary H264 stream (and ffmpeg comes with a good H264 encoder!) or even better if you have access to a Mpeg-4 muxer, as a mp4 file? This will reduce the memory requirements and the IO load dramatically.

Upvotes: 2

Steve-o
Steve-o

Reputation: 12866

With CreateFile it looks like you want one or both of these parameters:

  • FILE_FLAG_NO_BUFFERING
  • FILE_FLAG_WRITE_THROUGH

http://msdn.microsoft.com/en-us/library/cc644950(v=vs.85).aspx

Your delayed performance hit occurs when the OS starts pushing data to the disk.

6Gb/s is the performance capability of the SATA 2 bus not the actual devices connected or the physical platters or flash ram underneath.

A common problem with AV systems is constantly writing a high stream of data can get periodically interrupted by disk overhead tasks. There used to be special AV disks you can purchase that don't do this, these days you can purchase disks with special high throughput performance firmware explicitly for security video recording.

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=210671&NewLang=en

Upvotes: 3

Related Questions