Reputation: 3750
Consider a situation such as representing a sparse matrix. For example, the matrix could be 1,000,000 rows x 1,000,000 cols (or some other large size), with perhaps 50, 100, or a few thousand cells being non-zero values at any particular time.
I'm attempting to discern the best C++ data structure for representing this. The brute force and very bad answer would be (example only puts a value in 1 cell for brevity, imagine a few hundred or a few thousand cells being filled):
int numRows = 1000000;
int numCols = 1000000;
std::vector<std::vector<int>> sparseMatrix(numRows, std::vector<int>(numCols, 0));
int currentRow = 12345;
int currentCol = 98765;
sparseMatrix[currentRow][currentCol] = 10;
std::cout << "\n" << "sparseMatrix[currentRow][currentCol] = " << sparseMatrix[currentRow][currentCol] << "\n\n";
Clearly this is a disaster due to 99+% of the memory dedicated to the data structure not being used.
The next intuitive option (to me at least) was this:
std::unordered_map<std::pair<int, int>, int> sparseMatrix;
int currentRow = 12345;
int currentCol = 98765;
std::pair<int, int> rowCol = std::make_pair(currentRow, currentCol);
sparseMatrix[rowCol] = 10;
std::cout << "\n" << "sparseMatrix[rowCol] = " << sparseMatrix[rowCol] << "\n\n";
Unfortunately this fails to compile with the error:
attempting to reference a deleted function
After some Googling on this topic it seems unordered_map
is not set up to use a pair as a key.
As far as I can tell, there are 4 remaining legitimate options:
1) Use a map
, which does accept a pair of integers as a key, instead of an unordered_map
, ex (this compiles and runs):
std::map<std::pair<int, int>, int> sparseMatrix;
int currentRow = 12345;
int currentCol = 98765;
std::pair<int, int> rowCol = std::make_pair(currentRow, currentCol);
sparseMatrix[rowCol] = 10;
std::cout << "\n" << "sparseMatrix[rowCol] = " << sparseMatrix[rowCol] << "\n\n";
2) Use an unordered_map
of unordered_map
s, ex (this also compiles and runs):
std::unordered_map<int, std::unordered_map<int, int>> sparseMatrix;
int currentRow = 12345;
int currentCol = 98765;
sparseMatrix[currentRow][currentCol] = 10;
std::cout << "\n" << "sparseMatrix[currentRow][currentCol] = " << sparseMatrix[currentRow][currentCol] << "\n\n";
3) Make my own hash function for the row and col integers and feed that into a more typical std::unordered_map<int, int>
. This seems like a very bad option since if two integer pairs mapped to the same hash key that would be difficult to deal with.
4) Use boost::hash, which I gather would look something like:
std::unordered_map<std::pair<int, int>, int, boost::hash<pair<int, int>>> sparseMatrix;
I'd tend to not prefer this option b/c 1) the data structure looks very awkward, 2) I'm not sure how to do the rest of the implementation, and 3) in some cases boost may not be available.
So to clarify my questions, here they are:
1) Which option above is best for most situations? (I'd really prefer to stick to #1 or #2 if reasonably possible).
2) From what I know about map
s (red-black trees) vs unordered_map
s (hash tables) I'm under the impression that #1 would be the best on memory but #2 would be faster, is my understanding correct in this case?
3) If I'm correct on #1 being better on memory and #2 being faster, is there a clear winner in the general case I mentioned above (1,000,000 x 1,000,000 matrix with typically about 1,000 values populated) or is the difference roughly a wash?
4) How difficult would the implementation of #3 and #4 be? If #3 and/or #4 were implemented very well, would the performance benefit be enough to outweigh the coding complexity cost vs #1 or #2?
Before somebody marks this post as a duplicate, I've read this post Why can't I compile an unordered_map with a pair as key? which touches on the options above but not provide an answer to the questions I've asked here.
Before somebody says "use the built-in boot sparse matrix", yes, I'm aware that boost and some other libraries have a sparse matrix class already available. I'm still asking this question however, b/c an unordered map where the key is 2 integers my be useful in some other cases, and also some people may not have access to boost or may desire to do their own more specific implementation for a certain purpose.
Upvotes: 2
Views: 3162
Reputation: 17415
It may or may not solve your problem, but one of your assumptions is wrong:
3) Make my own hash function for the row and col integers and feed that into a more typical std::unordered_map. This seems like a very bad option since if two integer pairs mapped to the same hash key that would be difficult to deal with.
Dealing with hash collisions is not something you have to do but something that unordered_map
does for you. Even if all values' hashes mapped to the same integer, it would correctly make sure that different values are treated as different keys, even though the performance would degrade.
That said, a map of maps (map
or unordered_map
) would both work and provide reasonable performance assuming you only have the few elements that you mention.
Upvotes: 0
Reputation: 106068
Clearly this is a disaster due to 99+% of the memory dedicated to the data structure not being used.
That's not clear at all. Modern OSes tend to provide the application with virtual memory that only gets backed with physical RAM when accessed, so only the memory pages you write elements into need backing RAM. If you have at most thousands of entries in your array, and each memory page is say 4k, you'd be using order-of tens of megabytes - hardly a strain on a typical modern machine. So, it's wasteful, but not necessarily problematically wasteful. It's not cache friendly - the performance implications thereof might prove the greater concern.
4) Use boost::hash, which I gather would look something like:
std::unordered_map<std::pair<int, int>, int, boost::hash<pair<int, int>>> sparseMatrix;
I'd tend to not prefer this option b/c 1) the data structure looks very awkward, 2) I'm not sure how to do the rest of the implementation, and 3) in some cases boost may not be available.
1) looks awkward? come on... 2) there is nothing else to do - you just use it like any other unordered_map
3) you can create your own then based on boost's (see this q):
template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
std::hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}
struct hash_pair
{
std::size_t operator()(const std::pair<int, int>& p) const
{
std::size_t h = 0;
hash_combine(h, p.first);
hash_combine(h, p.second);
return h;
}
};
1) Which option above is best for most situations? (I'd really prefer to stick to #1 or #2 if reasonably possible).
None of your numbered options is the best for most situations: per your stated concerns about depending on boost, create your own based on boost's implementation of hash_combine
is the best general solution based on Standard Library containers.
2) From what I know about maps (red-black trees) vs unordered_maps (hash tables) I'm under the impression that #1 would be the best on memory but #2 would be faster, is my understanding correct in this case?
Memory usage won't be dramatically different. GCC's hash tables uses a linked list to store values, wherein each value requires a dynamic memory allocation with linking pointers, plus a contiguous array for the buckets (each being a list iterator; the arrays will be (re)sized to maintain a reasonable load factor, so won't be particularly big). map
s use a dynamic memory allocation per value too - but allocate a little extra for left/right pointers. Much of a muchness.
3) If I'm correct on #1 being better on memory and #2 being faster, is there a clear winner in the general case I mentioned above (1,000,000 x 1,000,000 matrix with typically about 1,000 values populated) or is the difference roughly a wash?
As mentioned, memory usage shouldn't be expected to be dramatically better for one over the other (though implementations may vary). As for faster, when there are so few values populated, just implement them both and measure. The advantages of hash tables more consistently dominate when the number of populated elements is large.
4) How difficult would the implementation of #3 and #4 be? If #3 and/or #4 were implemented very well, would the performance benefit be enough to outweigh the coding complexity cost vs #1 or #2?
As mentioned, you should be comparing #1 with a rip-off of #4. Forget #3 - it's fundamentally flawed as you realised yourself "very bad option since if two integer pairs mapped to the same hash key that would be difficult to deal with".
As for coding complexity - there is virtually none. Just copy the hash implementation above, specify the hash policy when instantiating the unordered_map
, and get on with using it.
If you encounter actual problems when you're implementing the options, then ask a new question to get help.
Upvotes: 1