Reputation: 651
First of all, I know there are many highly relevant questions, but my very first implementation (based on some suggestions from these Q&Q) is not efficient enough.
I am looking for a way to (significantly) improve my very first implementation of reading a huge (>10000x10000
) non-symmetric non-sparse 2-dimensional array (matrix) with string indices from the input text file. Assume also, that we don't know the size of the matrix in advance.
The structure of the external input file (think like a distance matrix between any two locations) looks something like this:
A B C D E F G
A 0 10 20 30 40 50 60
B 15 0 25 35 45 55 65
C 20 30 0 40 50 60 70
D 25 35 45 0 65 75 85
E 15 20 25 35 0 55 65
F 20 30 40 50 60 0 70
G 35 45 55 65 75 85 0
At the moment, I came up with the following solution:
std::map<std::string, std::map<std::string, int>>
ReadDistancesFromFile(const char *name) {
std::string filename(name);
std::clog << "Trying to open and read: " << filename << std::endl;
std::ifstream file(name);
/// If .is_open() returns False, perror prints the error code stored in errno
if (!file.is_open())
std::perror(("Error while opening file " + filename).c_str());
/// Map of maps to save all read distances
std::map<std::string, std::map<std::string, int>> distances;
/* 1. Is such an efficient structure (container) for my purpose:
a) to store data efficiently
b) to access data using indices quickly?
c) to update values time after time
d) insertion/deletion of new elements doesn't happen often */
/// Vector to store all `String` type indices
std::vector<std::string> indices;
/// String to store index (location name)
std::string index;
/// Store line from the external file
std::string line;
/// Read the first line containing all String indices (location names)
std::getline(file, line);
std::istringstream iss(line);
/// Process the first line: save all location names into `indices` vector
while (iss >> index) {
indices.push_back(index);
}
/* 2. Probably I could use .reserve() before the while loop?
The problem that I don't know the size in advance. */
/// Read the file via std::getline(). Rules obeyed:
/// - first the I/O operation, then error check, then data processing
/// - failbit and badbit prevent data processing, eofbit does not
while (std::getline(file, line)) {
std::istringstream is(line);
/* 3. Is it efficient to define a stringstream variable inside a loop? */
/// For each new line (matrix row), read the first String element (location name)
is >> index;
int distance; // To store distance value
uint column = 0; // Column number to access location names from `indices` vector
/// Process the line further: store Int distances from the input stream
while (is >> distance) {
distances[index][indices[column++]] = distance;
}
}
/// Only in case of set badbit we are sure that errno has been set
/// Use perror() to print error details
if (file.bad())
std::perror(("Error while reading file " + filename).c_str());
/// close file
file.close();
/// With C++11, std::map has move-semantics, which means the local map will be moved
/// on return and in some cases even the move can be elided by the compiler (RVO)
return distances;
}
First, I left three questions in the source code as comments. Your answers are very welcome.
Second, at the moment, I did some minimal benchmarks using a much smaller input file of ~2000x2000, and it took on my mid-range MacBook Pro (late 2015) around ~30 sec. I believe this is too long (performance in my case really matters) and would be grateful for your ideas on how to improve this code.
std::map
with std::unordered_map
keeping the rest unchanged. Quite surprisingly to me, this allowed reducing the execution time (reading the entire file) ~4-5 times, i.e., from ~30 sec. to ~5-6 sec. Not bad!std::map<std::string, std::map<std::string, int>>
with std::vector<int>
and all string indices saved in a separate std::unordered_map<std::string, size_t>
type container. Using this approach, execution time went down to ~1-2 seconds - i.e., at least 15 times faster compared to the initial approach!Upvotes: 2
Views: 708
Reputation: 7973
The most efficient way is to read the values into a one-dimensional std::vector<int>
. After the first line, you know the amount of columns in the input file. At the end, you know how many rows you have by dividing the size of the vector by the number of columns.
Then, you reinterpret the vector as a two dimensional array.
The first line can be read with std::getline()
and parsed using a std::istringstream
. However, all the other lines should just be parsed by doing:
int value;
file >> value;
distances.push_back(value);
Of course, you need to ignore the left-most column on every row.
By not reading it line by line, you avoid having to convert the line to a std::istringstream
, which is slower than parsing the values directly from file
.
std::vector<>
will automatically resize itself when necessary, such that adding to the end of the vector is an amortized O(1) operation.
At the end, you will have columns times rows values in the vector, and if you want to access column x
of row y
, then you have to write something like:
int desired_value = distances[x + y * columns];
If you need to be able to access the data using the names of the rows and columns, you need to store those names along with the index they represent.
The most efficient way is to store them into an std::unordered_map<>
, like so:
std::unordered_map<std::string, size_t> columns;
std::unordered_map<std::string, size_t> rows;
/// Read the first line containing all String indices (location names)
std::getline(file, line);
std::istringstream iss(line);
/// Process the first line: save all location names into `columns` map
std::string name;
size_t i = 0;
while (iss >> name)
columns[name] = i++;
/// Process other lines
...
Then, you can get the distance given a row
and column
name as follows:
size_t x = columns[column];
size_t y = rows[row];
int desired_value = distances[x + y * columns.size()];
A map is implemented as a balanced tree. Whenever you want to insert something, it has to traverse the tree to find out where to insert the new value. In general, that takes O(log(N)) time. But if you insert new values such that they always come at the end, the tree needs to be rebalanced frequently, which makes it even slower.
Moreover, your map stores a copy of the column name for every value, and a copy of the row name for every row. So with 10000 x 10000 elements, you are storing one hundred million strings, many of them identical, and you are not interested in those strings at all, only in which row or column index they represent.
Upvotes: 3