intrigued_66
intrigued_66

Reputation: 17240

fastest technique to read a file into memory?

Is there a generally-accepted fastest technique which is used to read a file into memory in c++?

I will only be reading the file.

I have seen boost have an implementation and I have seen a couple other implementations on here but I would like to know what is considered the fastest?

Thank you in advance

In case it matters, I am considering files up to 1GB and this is for windows.

Upvotes: 11

Views: 13185

Answers (5)

John Leidegren
John Leidegren

Reputation: 60987

When you open a file with CreateFile you can hint at what you intend to do. For example, you can combine FILE_FLAG_SEQUENTIAL_SCAN with FILE_FLAG_NO_BUFFERING to bypass lower level caching that would otherwise take place. It is then your responsibly to load the file block by block, that is, it will be very slow to do small reads but faster if you did large reads.

My Samsung 970 EVO Plus SSD 1TB NVMe is rated at 3.5 GB/s sequential read. So, with this approach, what do we see?

#include <Windows.h>

#include <cstdio>

void main() {
  char* buf = (char*)malloc(1024*1024*1024);

  // Not one gigabyte, just happens to be the largest file on my disk at the moment. If you wanna run this code you'll have to tweak some constants ;)
  HANDLE f = CreateFileW(
    L"C:\\Users\\leidegre\\Downloads\\555.85-desktop-win10-win11-64bit-international-dch-whql.exe", // 659 245 848 bytes
    GENERIC_READ,
    FILE_SHARE_READ,
    0,
    OPEN_ALWAYS,
    FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN,
    0);
  
  printf("f %p\n", f);

  LARGE_INTEGER t0;
  QueryPerformanceCounter(&t0);

  DWORD bytes_read;
  if (!ReadFile(f, buf, 1024*1024*1024, &bytes_read, 0)) {
    printf("!ReadFile: %u\n", GetLastError());
    CloseHandle(f);
    return;
  }

  printf("bytes_read %u\n", bytes_read);

  LARGE_INTEGER t1;
  QueryPerformanceCounter(&t1);

  LARGE_INTEGER freq;
  QueryPerformanceFrequency(&freq);

  LONGLONG elapsed = t1.QuadPart - t0.QuadPart;
  LONGLONG elapsed_milliseconds = (1000 * elapsed) / freq.QuadPart;

  printf("elapsed_milliseconds %lli\n", elapsed_milliseconds);
  printf("bytes_per_second %lli\n", 1000*(bytes_read/elapsed_milliseconds));
}

The results on my computer are as follows.

With FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN

elapsed_milliseconds  280
bytes_per_second      2354449000

Without FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN

elapsed_milliseconds 395
bytes_per_second     1668976000

Memory mapped file

elapsed_milliseconds 240
bytes_per_second     2746857000

Code for memory mapped file (replace ReadFile)

bytes_read = 659245848;

HANDLE m = CreateFileMappingW(f, NULL, PAGE_READONLY, 0, bytes_read, NULL);
if (!m) {
  printf("!CreateFileMappingW: %u\n", GetLastError());
  CloseHandle(f);
  return;
}

void* p = MapViewOfFile(m, FILE_MAP_READ, 0, 0, bytes_read);
if (!p) {
  printf("!MapViewOfFile: %u\n", GetLastError());
  CloseHandle(f);
  return;
}

memcpy(buf, p, bytes_read);

Overlapped IO

We're leaving about 1 GB/s on the table here.

To get further we need to issue multiple overlapped IO requests. This can actually be done quite easily (for sequential reads). Here's my last attempt at hitting 3.5 GB/s.

It's like a ring buffer, we prefetch up to Num_Req chunks of some size. We can process each chunk as it is completed after call to GetOverlappedResult (this is where blocking might happen).

#include <Windows.h>

#include <cstdio>

// These are tuning parameters
constexpr int   Num_Req    = 4;
constexpr DWORD Chunk_Size = 1024*1024;

HANDLE      ev[Num_Req];
OVERLAPPED  olp[Num_Req];

void make_req(HANDLE f, int i, DWORD off, char* buf )
{
    int n = i % Num_Req;
 
    OVERLAPPED *o = &olp[n];
    memset(o, 0, sizeof(OVERLAPPED));
    o->Offset = off;
    o->hEvent = ev[n];
 
    if (!ReadFile(f, buf, Chunk_Size, 0, o)) {
    if (GetLastError() == ERROR_IO_PENDING) {
      return;
    }
    printf("!ReadFile %u\n", GetLastError());
    abort();
    return;
  }
}

void main() {
  char* buf = (char*)malloc(1024*1024*1024);

  for (int i = 0; i < Num_Req; i += 1) {
      ev[i] = CreateEventW(NULL, TRUE, FALSE, NULL);
  }

  int f_size = 659245848;
  
  HANDLE f = CreateFileW(
    L"C:\\Users\\leidegre\\Downloads\\555.85-desktop-win10-win11-64bit-international-dch-whql.exe", // 659 245 848 bytes
    GENERIC_READ,
    FILE_SHARE_READ,
    0,
    OPEN_ALWAYS,
    FILE_FLAG_OVERLAPPED | FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN,
    0);
  
  printf("f %p\n", f);

  LARGE_INTEGER t0;
  QueryPerformanceCounter(&t0);

  DWORD bytes_read = 0;
  for (int i = 0; i < Num_Req; i += 1) {
    make_req(f, i, bytes_read, buf + bytes_read);
    bytes_read =+ Chunk_Size;
  }

  int chunk_count = (f_size + (Chunk_Size-1)) / Chunk_Size;
  for (int i = 0; i < chunk_count; i += 1) {
    int n = i % Num_Req;
 
      OVERLAPPED *o = &olp[n];
      DWORD bytes_read2;
      GetOverlappedResult(f, o, &bytes_read2, TRUE);

    int j = i + Num_Req;
    if (j < chunk_count) {
      make_req(f, j, bytes_read, buf + bytes_read);
      bytes_read += Chunk_Size;
    }
  }

  LARGE_INTEGER t1;
  QueryPerformanceCounter(&t1);

  LARGE_INTEGER freq;
  QueryPerformanceFrequency(&freq);

  printf("bytes_read %u\n", bytes_read);

  LONGLONG elapsed = t1.QuadPart - t0.QuadPart;
  LONGLONG elapsed_milliseconds = (1000 * elapsed) / freq.QuadPart;

  printf("elapsed_milliseconds %lli\n", elapsed_milliseconds);
  printf("bytes_per_second %lli\n", 1000*(bytes_read/elapsed_milliseconds));
}
elapsed_milliseconds 215
bytes_per_second     3053063000

We get pretty close, 3 GB/s. A couple of things to note. We either have to use a larger number of requests or a larger buffer per request. Either one will hit 3 GB/s so you can tune this if based on your need.

FILE_FLAG_SEQUENTIAL_SCAN doesn't seem to have an effect but FILE_FLAG_NO_BUFFERING does. I believe that without it, you may be measuring the wrong thing because if I run my overlapped IO bench without FILE_FLAG_NO_BUFFERING I get speeds up to 4.8 GB/s (one the second go) which is faster than what my hardware can muster which makes me think Windows is serving up data from memory (and not actually reading from disk).

Upvotes: 1

Steve Townsend
Steve Townsend

Reputation: 54138

In the event memory-mapped files are not adequate for your application, and file I/O is your bottleneck, using an I/O completion port to handle async I/O on the files will be the fastest you can get on Windows.

I/O completion ports provide an efficient threading model for processing multiple asynchronous I/O requests on a multiprocessor system. When a process creates an I/O completion port, the system creates an associated queue object for requests whose sole purpose is to service these requests. Processes that handle many concurrent asynchronous I/O requests can do so more quickly and efficiently by using I/O completion ports in conjunction with a pre-allocated thread pool than by creating threads at the time they receive an I/O request.

Upvotes: 1

Sarfaraz Nawaz
Sarfaraz Nawaz

Reputation: 361302

Consider using Memory-Mapped Files for your case, as the files can be upto 1 GB size.

And here you can start with win32 API:

There are several other helpful API on MSDN page.

Upvotes: 4

user405725
user405725

Reputation:

Generally speaking, mmap it is. But n Windows they have invented their own way of doing this, see "File Mapping". Boost has Memory-Mapped Files library that wraps both ways under a portable pile of code.

Also, you have to optimize for your use-case if you want to be fast. Just mapping file contents into memory is not enough. It is indeed possible that you don't need memory mapped files and could better off using async file I/O, for example. There are many solutions for many problems.

Upvotes: 0

Matteo Italia
Matteo Italia

Reputation: 126777

Use memory-mapped files, maybe using the boost wrapper for portability.

If you want to read files bigger than the free, contiguous portion of your virtual address space, you can move the mapped portion of the file at will.

Upvotes: 5

Related Questions