Reputation: 1618
On Python v2.7 in Windows and Linux, what is the most efficient and quick way to sequentially write 5GB of data to a local disk (fixed or removable)? This data will not soon be read and does not need cached.
It seems the normal ways of writing use the OS disk cache (because the system assumes it may re-read this data soon). This clears useful data of of the cache, making the system slower.
Right now I am using f.write() with 65535 bytes of data at a time.
Upvotes: 0
Views: 1267
Reputation: 20528
This answer is relevant for Windows code, I have no idea about the Linux equivalent though I imagine the advice is similar.
If you want to be write the fastest code possible, then write using the Win32API and make sure you read the relevant section of CreateFile. Specifically make sure you do not make the classic mistake of using the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags to open a file, for more explanation see Raymond Chen's classic blog post.
If you insist of writing at some multiple of sector or cluster size, then don't be beholden to the magic number of 65535 (why this number? It's no real multiple). Instead using GetDiskFreeSpace figure out the appropriate sector size, though even this is no real guarantee (some data may be kept with the NTFS file information).
Upvotes: 1
Reputation: 8910
The real reason your OS uses the disk cache isn't that it assumes the data will be re-read -- it's that it wants to speed up the writes. You want to use the OS's write cache as aggressively as you possibly can.
That being said, the "standard" way to do high-performance, high-volume I/O in any language (and probably the most aggressive way to use the OS's read/write caches) is to use memory-mapped I/O. The mmap module (https://docs.python.org/2/library/mmap.html) will provide that, and depending on how you generate your data in the first place, you might even be able to gain more performance by dumping it to the buffer earlier.
Note that with a dataset as big as yours, it'll only work on a 64-bit machine (Python's mmap on 32-bit is limited to 4GiB buffers).
If you want more specific advice, you'll have to give us more info on how you generate your data.
Upvotes: 3