Reputation: 17240
Is there a generally-accepted fastest technique which is used to read a file into memory in c++?
I will only be reading the file.
I have seen boost have an implementation and I have seen a couple other implementations on here but I would like to know what is considered the fastest?
Thank you in advance
In case it matters, I am considering files up to 1GB and this is for windows.
Upvotes: 11
Views: 13185
Reputation: 60987
When you open a file with CreateFile
you can hint at what you intend to do. For example, you can combine FILE_FLAG_SEQUENTIAL_SCAN
with FILE_FLAG_NO_BUFFERING
to bypass lower level caching that would otherwise take place. It is then your responsibly to load the file block by block, that is, it will be very slow to do small reads but faster if you did large reads.
My Samsung 970 EVO Plus SSD 1TB NVMe is rated at 3.5 GB/s sequential read. So, with this approach, what do we see?
#include <Windows.h>
#include <cstdio>
void main() {
char* buf = (char*)malloc(1024*1024*1024);
// Not one gigabyte, just happens to be the largest file on my disk at the moment. If you wanna run this code you'll have to tweak some constants ;)
HANDLE f = CreateFileW(
L"C:\\Users\\leidegre\\Downloads\\555.85-desktop-win10-win11-64bit-international-dch-whql.exe", // 659 245 848 bytes
GENERIC_READ,
FILE_SHARE_READ,
0,
OPEN_ALWAYS,
FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN,
0);
printf("f %p\n", f);
LARGE_INTEGER t0;
QueryPerformanceCounter(&t0);
DWORD bytes_read;
if (!ReadFile(f, buf, 1024*1024*1024, &bytes_read, 0)) {
printf("!ReadFile: %u\n", GetLastError());
CloseHandle(f);
return;
}
printf("bytes_read %u\n", bytes_read);
LARGE_INTEGER t1;
QueryPerformanceCounter(&t1);
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
LONGLONG elapsed = t1.QuadPart - t0.QuadPart;
LONGLONG elapsed_milliseconds = (1000 * elapsed) / freq.QuadPart;
printf("elapsed_milliseconds %lli\n", elapsed_milliseconds);
printf("bytes_per_second %lli\n", 1000*(bytes_read/elapsed_milliseconds));
}
The results on my computer are as follows.
With FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN
elapsed_milliseconds 280
bytes_per_second 2354449000
Without FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN
elapsed_milliseconds 395
bytes_per_second 1668976000
Memory mapped file
elapsed_milliseconds 240
bytes_per_second 2746857000
Code for memory mapped file (replace ReadFile)
bytes_read = 659245848;
HANDLE m = CreateFileMappingW(f, NULL, PAGE_READONLY, 0, bytes_read, NULL);
if (!m) {
printf("!CreateFileMappingW: %u\n", GetLastError());
CloseHandle(f);
return;
}
void* p = MapViewOfFile(m, FILE_MAP_READ, 0, 0, bytes_read);
if (!p) {
printf("!MapViewOfFile: %u\n", GetLastError());
CloseHandle(f);
return;
}
memcpy(buf, p, bytes_read);
We're leaving about 1 GB/s on the table here.
To get further we need to issue multiple overlapped IO requests. This can actually be done quite easily (for sequential reads). Here's my last attempt at hitting 3.5 GB/s.
It's like a ring buffer, we prefetch up to Num_Req
chunks of some size. We can process each chunk as it is completed after call to GetOverlappedResult
(this is where blocking might happen).
#include <Windows.h>
#include <cstdio>
// These are tuning parameters
constexpr int Num_Req = 4;
constexpr DWORD Chunk_Size = 1024*1024;
HANDLE ev[Num_Req];
OVERLAPPED olp[Num_Req];
void make_req(HANDLE f, int i, DWORD off, char* buf )
{
int n = i % Num_Req;
OVERLAPPED *o = &olp[n];
memset(o, 0, sizeof(OVERLAPPED));
o->Offset = off;
o->hEvent = ev[n];
if (!ReadFile(f, buf, Chunk_Size, 0, o)) {
if (GetLastError() == ERROR_IO_PENDING) {
return;
}
printf("!ReadFile %u\n", GetLastError());
abort();
return;
}
}
void main() {
char* buf = (char*)malloc(1024*1024*1024);
for (int i = 0; i < Num_Req; i += 1) {
ev[i] = CreateEventW(NULL, TRUE, FALSE, NULL);
}
int f_size = 659245848;
HANDLE f = CreateFileW(
L"C:\\Users\\leidegre\\Downloads\\555.85-desktop-win10-win11-64bit-international-dch-whql.exe", // 659 245 848 bytes
GENERIC_READ,
FILE_SHARE_READ,
0,
OPEN_ALWAYS,
FILE_FLAG_OVERLAPPED | FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN,
0);
printf("f %p\n", f);
LARGE_INTEGER t0;
QueryPerformanceCounter(&t0);
DWORD bytes_read = 0;
for (int i = 0; i < Num_Req; i += 1) {
make_req(f, i, bytes_read, buf + bytes_read);
bytes_read =+ Chunk_Size;
}
int chunk_count = (f_size + (Chunk_Size-1)) / Chunk_Size;
for (int i = 0; i < chunk_count; i += 1) {
int n = i % Num_Req;
OVERLAPPED *o = &olp[n];
DWORD bytes_read2;
GetOverlappedResult(f, o, &bytes_read2, TRUE);
int j = i + Num_Req;
if (j < chunk_count) {
make_req(f, j, bytes_read, buf + bytes_read);
bytes_read += Chunk_Size;
}
}
LARGE_INTEGER t1;
QueryPerformanceCounter(&t1);
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
printf("bytes_read %u\n", bytes_read);
LONGLONG elapsed = t1.QuadPart - t0.QuadPart;
LONGLONG elapsed_milliseconds = (1000 * elapsed) / freq.QuadPart;
printf("elapsed_milliseconds %lli\n", elapsed_milliseconds);
printf("bytes_per_second %lli\n", 1000*(bytes_read/elapsed_milliseconds));
}
elapsed_milliseconds 215
bytes_per_second 3053063000
We get pretty close, 3 GB/s. A couple of things to note. We either have to use a larger number of requests or a larger buffer per request. Either one will hit 3 GB/s so you can tune this if based on your need.
FILE_FLAG_SEQUENTIAL_SCAN
doesn't seem to have an effect but FILE_FLAG_NO_BUFFERING
does. I believe that without it, you may be measuring the wrong thing because if I run my overlapped IO bench without FILE_FLAG_NO_BUFFERING
I get speeds up to 4.8 GB/s (one the second go) which is faster than what my hardware can muster which makes me think Windows is serving up data from memory (and not actually reading from disk).
Upvotes: 1
Reputation: 54138
In the event memory-mapped files are not adequate for your application, and file I/O is your bottleneck, using an I/O completion port to handle async I/O on the files will be the fastest you can get on Windows.
I/O completion ports provide an efficient threading model for processing multiple asynchronous I/O requests on a multiprocessor system. When a process creates an I/O completion port, the system creates an associated queue object for requests whose sole purpose is to service these requests. Processes that handle many concurrent asynchronous I/O requests can do so more quickly and efficiently by using I/O completion ports in conjunction with a pre-allocated thread pool than by creating threads at the time they receive an I/O request.
Upvotes: 1
Reputation: 361302
Consider using Memory-Mapped Files for your case, as the files can be upto 1 GB size.
And here you can start with win32 API:
There are several other helpful API on MSDN page.
Upvotes: 4
Reputation:
Generally speaking, mmap it is. But n Windows they have invented their own way of doing this, see "File Mapping". Boost has Memory-Mapped Files library that wraps both ways under a portable pile of code.
Also, you have to optimize for your use-case if you want to be fast. Just mapping file contents into memory is not enough. It is indeed possible that you don't need memory mapped files and could better off using async file I/O, for example. There are many solutions for many problems.
Upvotes: 0
Reputation: 126777
Use memory-mapped files, maybe using the boost wrapper for portability.
If you want to read files bigger than the free, contiguous portion of your virtual address space, you can move the mapped portion of the file at will.
Upvotes: 5