Veridian
Veridian

Reputation: 3657

Fastest Way to Read a File Into Memory in c++?

I'm trying to read from a file in a faster way. The current way I'm doing it is the following, but it is very slow for large files. I am wondering if there is a faster way to do this? I need the values stored a struct, which I have defined below.

std::vector<matEntry> matEntries;
inputfileA.open(matrixAfilename.c_str());

// Read from file to continue setting up sparse matrix A
while (!inputfileA.eof()) {
    // Read row, column, and value into vector
    inputfileA >> (int) row; // row
    inputfileA >> (int) col; // col
    inputfileA >> val;       // value

    // Add row, column, and value entry to the matrix
    matEntries.push_back(matEntry());
    matEntries[index].row = row-1;
    matEntries[index].col = col-1;
    matEntries[index].val = val;

    // Increment index
    index++;
}

my struct:

struct matEntry {
    int row;
    int col;
    float val;
};

The file is formatted like this (int, int, float):

1 2 7.9
4 5 9.008
6 3 7.89
10 4 10.21

More info:

Upvotes: 2

Views: 1527

Answers (3)

metal
metal

Reputation: 6332

As suggested in the comments, you should profile your code before trying to optimize. If you want to try random stuff until the performance is good enough, you can try reading it into memory first. Here's a simple example with some basic profiling written in:

#include <vector>
#include <ctime>
#include <fstream>
#include <sstream>
#include <iostream>

// Assuming something like this...
struct matEntry
{
    int row, col;
    double val;
};

std::istream& operator << ( std::istream& is, matEntry& e )
{ 
    is >> matEntry.row >> matEntry.col >> matEntry.val;
    matEntry.row -= 1;
    matEntry.col -= 1;
    return is;
}


std::vector<matEntry> ReadMatrices( std::istream& stream )
{
    auto matEntries = std::vector<matEntry>();

    auto e = matEntry();
    // For why this is better than your EOF test, see https://isocpp.org/wiki/faq/input-output#istream-and-while
    while( stream >> e ) {
        matEntries.push_back( e );
    }
    return matEntries;
}

int main()
{
    const auto time0 = std::clock();

    // Read file a piece at a time
    std::ifstream inputFileA( "matFileA.txt" );
    const auto matA = ReadMatrices( inputFileA );

    const auto time1 = std::clock();

    // Read file into memory (from http://stackoverflow.com/a/2602258/201787)
    std::ifstream inputFileB( "matFileB.txt" );
    std::stringstream buffer;
    buffer << inputFileB.rdbuf();
    const auto matB = ReadMatrices( buffer );

    const auto time2 = std::clock();
    std::cout << "A: " << ((time1 - time0) * CLOCKS_PER_SEC) << "  B: " << ((time2 - time1) * CLOCKS_PER_SEC) << "\n";
    std::cout << matA.size() << " " << matB.size();
}

Beware reading the same file on disk twice in a row since the disk caching may hide performance differences.

Other options include:

  • Preallocate space in your vector (perhaps adding a size to file format or estimating it based on file size or something)
  • Change your file format to be binary or perhaps compressed data to minimize read time
  • Memory map the file
  • Parallelize (easy: process file A and file B in separate threads [see std::async()]; medium: pipeline it so the read and convert are done on different threads; hard: process the same file in separate threads)

Other higher-level considerations might include:

  • It looks like you have a 4-D array of data (rows/cols of 2D matrices). In many applications, this is a mistake. Take a moment to reconsider if this data structure is really what you need.
  • There are many high-quality matrix libraries available (e.g., Boost.QVM, Blaze, etc.). Use them rather than reinventing the wheel.

Upvotes: 2

Leon
Leon

Reputation: 32544

In my experience, the slowest part in such code is the parsing of numeric values (especially the floating point ones). Therefore your code is most probably CPU-bound and can be sped-up through parallelization as follows:

Assuming that your data is on N lines and you are going to process it using k threads, each thread will have to handle about [N/k] lines.

  1. mmap() the file.
  2. Scan the entire file for newline symbols and identify the range that you are going to assign to every thread.
  3. Let each thread process its range in parallel by using an implementation of an std::istream that wraps an in-memory buffer).

Note that this will require ensuring that the code for populating your data structure is thread safe.

Upvotes: 2

NaCl
NaCl

Reputation: 2723

To make things easier, I'd define an input stream operator for your struct.

std::istream& operator>>(std::istream& is, matEntry& e)
{
    is >> e.row >> e.col >> e.val;
    e.row -= 1;
    e.col -= 1;

    return is;
}

Regarding speed, there is not much to improve without going to a very basic level of file IO. I think the only thing you could do is to initialize your vector such that it doesn't resize all the time inside the loop. And with the defined input stream operator it looks much cleaner as well:

std::vector<matEntry> matEntries;
matEntries.resize(numberOfLines);
inputfileA.open(matrixAfilename.c_str());

// Read from file to continue setting up sparse matrix A
while(index < numberOfLines && (is >> matEntries[index++]))
{  }

Upvotes: 3

Related Questions