Reputation: 133
I have a piece of code that needs to be able to modify a few bytes towards the end of a file. The problem is that the files are huge. Up to 100+ Gb.
I need the operation to be as fast as possible but after hours of Googeling, it looks like .Net is rather limited here???
I have mostly been trying using System.IO.FileStream and know of no other methods. A "reverse" filestream would do but I have know idea how to create one (write from the end instead of the beginning).
Here is sort of what I do: (Note: the time is spent when closing the stream)
static void Main(string[] args)
{
//Simulate a large file
int size = 1000 * 1024 * 1024;
string filename = "blah.dat";
FileStream fs = new FileStream(filename, FileMode.Create);
fs.SetLength(size);
fs.Close();
//Modify the last byte
fs = new FileStream(filename, FileMode.Open);
//If I don't seek, the modification happens instantly
fs.Seek(-1, SeekOrigin.End);
fs.WriteByte(255);
//Now, since I am modifying the last byte,
//this last step is very slow
fs.Close();
}
}
Upvotes: 13
Views: 12720
Reputation: 131712
Possibly the fastest way to work with large files in using a MemoryMappedFile. A memory-mapped file is a file that is mapped (not loaded) into virtual memory so you can access random bytes in it without having to seek to a specific location, load buffers etc. You can also read entire structures directly from the file without going through deserialization.
The following code, coming straight out of MSDN, loads and stores a MyColor structure in the middle of a 512MB file:
static void Main(string[] args)
{
long offset = 0x10000000; // 256 megabytes
long length = 0x20000000; // 512 megabytes
// Create a memory-mapped view of a portion of
// an extremely large image, from the 256th megabyte (the offset)
// to the 768th megabyte (the offset plus length).
using (var mmf =
MemoryMappedFile.CreateFromFile(@"c:\ExtremelyLargeImage.data",
FileMode.Open,"ImgA"))
{
using (var accessor = mmf.CreateViewAccessor(offset, length))
{
int colorSize = Marshal.SizeOf(typeof(MyColor));
MyColor color;
// Make changes to the view.
for (long i = 0; i < length; i += colorSize)
{
accessor.Read(i, out color);
color.Brighten(10);
accessor.Write(i, ref color);
}
}
}
}
public struct MyColor
{
public short Red;
public short Green;
public short Blue;
public short Alpha;
// Make the view brigher.
public void Brighten(short value)
{
Red = (short)Math.Min(short.MaxValue, (int)Red + value);
Green = (short)Math.Min(short.MaxValue, (int)Green + value);
Blue = (short)Math.Min(short.MaxValue, (int)Blue + value);
Alpha = (short)Math.Min(short.MaxValue, (int)Alpha + value);
}
}
You can find more info and samples at Memory-Mapped Files
Upvotes: 4
Reputation: 1039418
I've performed a few tests and results are a bit confusing. If you create the file and modify it in the same program it is slow:
static void Main(string[] args)
{
//Simulate a large file
int size = 100 * 1024 * 1024;
string filename = "blah.datn";
using (var fs = new FileStream(filename, FileMode.Create))
{
fs.SetLength(size);
}
using (var fs = new FileStream(filename, FileMode.Open))
{
fs.Seek(-1, SeekOrigin.End);
fs.WriteByte(255);
}
}
But if the file exists and you only try to modify the last byte it is fast:
static void Main(string[] args)
{
string filename = "blah.datn";
using (var fs = new FileStream(filename, FileMode.Open))
{
fs.Seek(-1, SeekOrigin.End);
fs.WriteByte(255);
}
}
Hmmm...
UPDATE:
Please ignore my previous observations and unmark this as an answer because it is all wrong.
Further investigating the issue I've noticed the following pattern. Suppose that you allocate a file of given size with zero bytes like this:
using (var stream = File.OpenWrite("blah.dat"))
{
stream.SetLength(100 * 1024 * 1024);
}
This operation is very fast and it creates a 100MB file filled with zeros.
Now if in some other program you try to modify the last byte, closing the stream will be slow:
using (var stream = File.OpenWrite("blah.dat"))
{
stream.Seek(-1, SeekOrigin.End);
stream.WriteByte(255);
}
I have no idea of the internal workings of the file system or how exactly is this file created but I have the feeling that it is not completely initialized until you try to modify it and closing the handle will be slow.
To confirm this I tested in unmanaged code (feel free to fix any aberration as my C is very rusty):
void main()
{
int size = 100 * 1024 * 1024 - 1;
FILE *handle = fopen("blah.dat", "wb");
if (handle != NULL) {
fseek(handle, size, SEEK_SET);
char buffer[] = {0};
fwrite(buffer, 1, 1, handle);
fclose(handle);
}
}
This behaves the same way as in .NET => it allocates a file of 100MB filled with zeros and it is very fast.
Now when I try to modify the last byte of this file:
void main()
{
int size = 100 * 1024 * 1024 - 1;
FILE *handle = fopen("blah.datn", "rb+");
if (handle != NULL) {
fseek(handle, -1, SEEK_END);
char buffer[] = {255};
fwrite(buffer, 1, 1, handle);
fclose(handle);
}
}
The last fclose(handle)
is slow. I hope some experts will bring some light here.
It seems though that modifying the last byte of a real file (not sparse) using the previous methods is very fast.
Upvotes: 4
Reputation: 10129
I suggest you try it with a real file rather than a "simulated" file. It may be that .net is using some sparse allocation mechanism and only writes out the file up to the last byte actually written to.
So, when you write to the beginning of the file it only has to write out a few bytes, but when you write to the end of the file it actually has to write out the whole file.
Upvotes: 2
Reputation: 273701
Like Darin already noted, this is an artifact of your 'simulation' of a large file.
The delay is from actually 'filling up' the file, the delay only happens the first time. If you repeat the part from //Modify the last byte
to fs.Close();
it will be very fast.
Upvotes: 11