Reputation: 11
I have a program where I want to store kmers (substrings of size k) and the number of times they appear. For this particular application, I'm reading in a file with these values and if the number of times they appear is > 255, it is ok to round down to 255. I thought that if I store the key-value pairs as (string, unsigned char) that might save space compared to storing the key-value pairs as (string, int), but this did not seem to be the case when I checked the max resident size by running /usr/bin/time.
To confirm, I also tried running the following test program where I alternated the type of the value in the unordered_map:
#include <iostream>
#include <unordered_map>
#include <utility>
#include <string>
#include <fstream>
int main() {
std::unordered_map<std::string, unsigned char> kmap;
std::ifstream infile("kmers_from_reads");
std::string kmer;
int abun;
while(infile >> kmer >> abun) {
unsigned char abundance = (abun > 255) ? 255 : abun;
kmap[kmer] = abundance;
}
std::cout << sizeof(*kmap.begin(0)) << std::endl;
}
This did not seem to impact the size of the nodes in the bucket (on my machine it returned 40 for both unsigned char and int values).
I was wondering how the size of the nodes in each bucket is determined.
My understanding of unordered maps is that the c++ standard more or less requires separate chaining and each node in a bucket must have at least one pointer so that the elements are iterable and can be erased (http://bannalia.blogspot.com/2013/10/implementation-of-c-unordered.html). However, I don't understand how the amount of space to store a value is determined, and it seems like it must also be flexible to accommodate larger values. I also tried looking at the gcc libstc++ unordered_map header (https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/unordered_map.h) but had a hard time understanding what was going on.
Upvotes: 0
Views: 458
Reputation: 182827
Compile and execute this code:
#include <iostream>
#include <unordered_map>
#include <utility>
#include <string>
#include <fstream>
class foo
{
std::string kmer;
unsigned char abun;
};
class bar
{
std::string kmer;
int abun;
};
int main() {
std::cout << sizeof(foo) << " " << sizeof(bar) << std::endl;
}
I get, and you probably will too, 40 40
. This is because of alignment requirements. If, for example, std::string
contains at least one pointer (which it almost certainly does), it has to be aligned on at least a 4-byte boundary.
Imagine if sizeof(foo)
was 39 and you had code that did foo foos[2]
. If the pointer in foos[0].kmer
was properly aligned, the pointer in foos[1].kmer
wouldn't be. That would be a disaster.
Upvotes: 1