greatdisplayname
greatdisplayname

Reputation: 11

Node size for unordered_map buckets

I have a program where I want to store kmers (substrings of size k) and the number of times they appear. For this particular application, I'm reading in a file with these values and if the number of times they appear is > 255, it is ok to round down to 255. I thought that if I store the key-value pairs as (string, unsigned char) that might save space compared to storing the key-value pairs as (string, int), but this did not seem to be the case when I checked the max resident size by running /usr/bin/time.

To confirm, I also tried running the following test program where I alternated the type of the value in the unordered_map:

#include <iostream>
#include <unordered_map>
#include <utility>
#include <string>
#include <fstream>

int main() {
    std::unordered_map<std::string, unsigned char> kmap;
    std::ifstream infile("kmers_from_reads");
    std::string kmer;
    int abun;

    while(infile >> kmer >> abun) {
        unsigned char abundance = (abun > 255) ? 255 : abun;
        kmap[kmer] = abundance;
    }

    std::cout << sizeof(*kmap.begin(0)) << std::endl; 
}

This did not seem to impact the size of the nodes in the bucket (on my machine it returned 40 for both unsigned char and int values).

I was wondering how the size of the nodes in each bucket is determined.

My understanding of unordered maps is that the c++ standard more or less requires separate chaining and each node in a bucket must have at least one pointer so that the elements are iterable and can be erased (http://bannalia.blogspot.com/2013/10/implementation-of-c-unordered.html). However, I don't understand how the amount of space to store a value is determined, and it seems like it must also be flexible to accommodate larger values. I also tried looking at the gcc libstc++ unordered_map header (https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/unordered_map.h) but had a hard time understanding what was going on.

Upvotes: 0

Views: 458

Answers (1)

David Schwartz
David Schwartz

Reputation: 182827

Compile and execute this code:

#include <iostream>
#include <unordered_map>
#include <utility>
#include <string>
#include <fstream>

class foo
{
   std::string kmer;
   unsigned char abun;
};

class bar
{
    std::string kmer;
    int abun;
};

int main() {
    std::cout << sizeof(foo) << " " << sizeof(bar) << std::endl;
}

I get, and you probably will too, 40 40. This is because of alignment requirements. If, for example, std::string contains at least one pointer (which it almost certainly does), it has to be aligned on at least a 4-byte boundary.

Imagine if sizeof(foo) was 39 and you had code that did foo foos[2]. If the pointer in foos[0].kmer was properly aligned, the pointer in foos[1].kmer wouldn't be. That would be a disaster.

Upvotes: 1

Related Questions