Remis
Remis

Reputation: 651

Reading huge 2-dimensional array (matrix) with string indices from the text file efficiently

First of all, I know there are many highly relevant questions, but my very first implementation (based on some suggestions from these Q&Q) is not efficient enough.

I am looking for a way to (significantly) improve my very first implementation of reading a huge (>10000x10000) non-symmetric non-sparse 2-dimensional array (matrix) with string indices from the input text file. Assume also, that we don't know the size of the matrix in advance.

The structure of the external input file (think like a distance matrix between any two locations) looks something like this:

  A   B   C   D   E   F   G
A 0   10  20  30  40  50  60
B 15  0   25  35  45  55  65
C 20  30  0   40  50  60  70
D 25  35  45  0   65  75  85
E 15  20  25  35  0   55  65
F 20  30  40  50  60  0   70
G 35  45  55  65  75  85  0

At the moment, I came up with the following solution:

std::map<std::string, std::map<std::string, int>> 
ReadDistancesFromFile(const char *name) {
  std::string filename(name);

  std::clog << "Trying to open and read: " << filename << std::endl;
  std::ifstream file(name);

  /// If .is_open() returns False, perror prints the error code stored in errno
  if (!file.is_open())
    std::perror(("Error while opening file " + filename).c_str());

  /// Map of maps to save all read distances
  std::map<std::string, std::map<std::string, int>> distances;

  /* 1. Is such an efficient structure (container) for my purpose:
        a) to store data efficiently
        b) to access data using indices quickly?
        c) to update values time after time
        d) insertion/deletion of new elements doesn't happen often */

  /// Vector to store all `String` type indices
  std::vector<std::string> indices;

  /// String to store index (location name)
  std::string index;

  /// Store line from the external file
  std::string line;

  /// Read the first line containing all String indices (location names)
  std::getline(file, line);

  std::istringstream iss(line);

  /// Process the first line: save all location names into `indices` vector
  while (iss >> index) {
    indices.push_back(index);
  }

  /* 2. Probably I could use .reserve() before the while loop?
        The problem that I don't know the size in advance. */

  /// Read the file via std::getline(). Rules obeyed:
  ///   - first the I/O operation, then error check, then data processing
  ///   - failbit and badbit prevent data processing, eofbit does not
  while (std::getline(file, line)) {
    std::istringstream is(line);

    /* 3. Is it efficient to define a stringstream variable inside a loop? */

    /// For each new line (matrix row), read the first String element (location name)
    is >> index;

    int distance;     // To store distance value
    uint column = 0;  // Column number to access location names from `indices` vector

    /// Process the line further: store Int distances from the input stream
    while (is >> distance) {
      distances[index][indices[column++]] = distance;
    }
  }

  /// Only in case of set badbit we are sure that errno has been set
  /// Use perror() to print error details
  if (file.bad())
    std::perror(("Error while reading file " + filename).c_str());

  /// close file
  file.close();

  /// With C++11, std::map has move-semantics, which means the local map will be moved
  /// on return and in some cases even the move can be elided by the compiler (RVO)
  return distances;
}

Quick update on the performance

  1. After reading Is there any advantage of using map over unordered_map in case of trivial keys? I decided to replace std::map with std::unordered_map keeping the rest unchanged. Quite surprisingly to me, this allowed reducing the execution time (reading the entire file) ~4-5 times, i.e., from ~30 sec. to ~5-6 sec. Not bad!
  2. Then, I revised my implementation based on G. Sliepen answer https://stackoverflow.com/a/57562007/3737891, i.e, I replaced std::map<std::string, std::map<std::string, int>> with std::vector<int> and all string indices saved in a separate std::unordered_map<std::string, size_t> type container. Using this approach, execution time went down to ~1-2 seconds - i.e., at least 15 times faster compared to the initial approach!

Upvotes: 2

Views: 708

Answers (1)

G. Sliepen
G. Sliepen

Reputation: 7973

Efficient parsing of a matrix

The most efficient way is to read the values into a one-dimensional std::vector<int>. After the first line, you know the amount of columns in the input file. At the end, you know how many rows you have by dividing the size of the vector by the number of columns. Then, you reinterpret the vector as a two dimensional array.

The first line can be read with std::getline() and parsed using a std::istringstream. However, all the other lines should just be parsed by doing:

int value;
file >> value;
distances.push_back(value);

Of course, you need to ignore the left-most column on every row.

By not reading it line by line, you avoid having to convert the line to a std::istringstream, which is slower than parsing the values directly from file.

std::vector<> will automatically resize itself when necessary, such that adding to the end of the vector is an amortized O(1) operation.

At the end, you will have columns times rows values in the vector, and if you want to access column x of row y, then you have to write something like:

int desired_value = distances[x + y * columns];

Accessing matrix elements by row and column names

If you need to be able to access the data using the names of the rows and columns, you need to store those names along with the index they represent. The most efficient way is to store them into an std::unordered_map<>, like so:

std::unordered_map<std::string, size_t> columns;
std::unordered_map<std::string, size_t> rows;

/// Read the first line containing all String indices (location names)
std::getline(file, line);
std::istringstream iss(line);

/// Process the first line: save all location names into `columns` map
std::string name;
size_t i = 0;

while (iss >> name)
    columns[name] = i++;

/// Process other lines
...

Then, you can get the distance given a row and column name as follows:

size_t x = columns[column];
size_t y = rows[row];
int desired_value = distances[x + y * columns.size()];

Why a map of a map is inefficient

A map is implemented as a balanced tree. Whenever you want to insert something, it has to traverse the tree to find out where to insert the new value. In general, that takes O(log(N)) time. But if you insert new values such that they always come at the end, the tree needs to be rebalanced frequently, which makes it even slower.

Moreover, your map stores a copy of the column name for every value, and a copy of the row name for every row. So with 10000 x 10000 elements, you are storing one hundred million strings, many of them identical, and you are not interested in those strings at all, only in which row or column index they represent.

Upvotes: 3

Related Questions