Count the occurrence of each element in large data stream

Question

I have a simulation, with N particles, running over T timesteps. At each timestep, each particle calculates some data about itself and the other particles nearby (within radius), which is bitpacked into a c-string of 4-22 bytes long (depending on how many nearby particles there are). I call this a State String.

I need to count how many times each state string occurs, to form a histogram. I've tried using Google's Sparse Hash Map, but the memory overhead is crazy.

I've been running some reduced tests (attached) over 100,000 Timesteps, for 500 particles. This results in just over 18.2mil unique state strings out of 50mil possible state strings, which is consistent with the actual work that needs to be done.

It ends up using 323 MB in space for the char* and int for each unique entry as well as as the actual state string itself. However, task manager is reporting 870M used. This is 547M of overhead, or ~251.87 bits/entry, way over what Google advertises of about 4-5 bits.

So I figure I've got to be doing something wrong. But then I found this site, which showed similar results, however, I'm not sure if his charts show just the hash table size, or include the size of the actual data as well. Additionally, his code does not free any strings being inserted into the hashmap that already exist (Meaning if his charts do include the size of the actual data, it is going to be over).

Here is some code showing the problem with the output:

#include 
#include 
#include 
#include 
#include 

//String equality
struct eqstrc
{
    bool operator()(const char* s1, const char* s2) const
    {
        return (s1 == s2) || (s1 && s2 && !strcmp(s1,s2));
    }   
};

//Hashing function
template 
class fnv1Hash
{
public:
    size_t operator()(const T& c) const {
            unsigned int hash = 2166136261;
            const unsigned char *key = (const unsigned char*)(c);
            size_t L = strlen((const char*)c);
            size_t i = 0;
            for(const unsigned char *s = key; i < L; ++s, ++i)
                hash = (16777619 * hash) ^ (*s);
            return (size_t)hash;
    }
};

//Function to form new string
char * new_string_from_integer(int num)
{
    int ndigits = num == 0 ? 1 : (int)log10((float)num) + 1;
    char * str = (char *)malloc(ndigits + 1);
    sprintf(str, "%d", num);
    return str;
}

typedef google::sparse_hash_map, eqstrc> HashCharMap;


int main()
{
    HashCharMap hashMapChar;
    int N = 500;
    int T = 100000;
    
    //Fill hash table with strings
    for(int k = 0; k < T; ++k)
    {
        for(int i = 0; i < N; ++i)
        {
            char * newString = new_string_from_integer(i*k);
            std::pair res =  hashMapChar.insert(HashCharMap::value_type(newString, HashCharMap::data_type()));
            (res.first)->second++;

            if(res.second == false) //If the string already in hash map, don't need this memory
                free(newString);
        }
    }

    //Count memory used by key 
    size_t dataCount = 0;
    for(HashCharMap::iterator hashCharItr = hashMapChar.begin(); hashCharItr != hashMapChar.end(); ++hashCharItr)
    {
        dataCount += sizeof(char*) + sizeof(unsigned int); //Size of data to store entries
        dataCount += (((strlen(hashCharItr->first) + 1) + 3) & ~0x03); //Size of entries, padded to 4 byte boundaries
    }
    printf("Hash Map Size: %lu
", (unsigned long)hashMapChar.size());
    printf("Bytes written: %lu
", (unsigned long)dataCount);

    system("pause");
}

Output

Hash Map Size: 18218975
Bytes written: 339018772
Peak Working Set (Reported by TaskManager): 891,228 K
Overhead: 560,155 K, or 251.87 bits/entry

I've tried both Google Sparse Hash Map v1.10 and v2.0.2.

Am I doing something wrong in my use of the hash map. Or is there a better way to approach this, because with these strings, I'd be almost as well off just storing the list of strings, sorting, then counting consecutive entries.

Thanks for any help

Edit

Because I was asked, here is format of the actual data: Each component is 2 bytes, and broken up into two subparts. 12bits, and 4bits.

First two bytes (short): [id of current particle (12 bits) | angle of current particle (4 bits)]
Second short: [number of interacting particles (12 bits)(N) | previous angle of current particle (4 bits)]
For next N shorts: [id of particle i (12 bits) | previous angle of particle i (4 bits)]

Angles are approximated (divided by 16), to store in 4 bits.

That's a bit wordy, so I'll write an example:

0x120A 0x001B 0x136F = Particle 288 (0x120), with angle 10 (0xA). Had angle 11 (0xB) in previous timestep. Interacts with 1 (0x001) other particle. This other particle is Particle 310 (0x136) and had angle 15 (0xF) in previous timestep.

Particles interact with between 0 to 9 other particles, hence the 4-22 bytes I mentioned above (although, rarely, can interact with up to 12 or more other particles. There is no limit. If all 500 particles are within the radius, then the string will be 1004 bytes long)

Additional information: The hashing function and compare function in my actual code use the size stored in the most significant 12 bits of the second short to do processing, since non-terminal 0x0000s can appear in my state strings. That all works fine.

laune · Accepted Answer

These figures are from experiments with gcc on Linux. Allocating short chunks of 4-22 bytes requires 16 bytes for lengths from 1 - 12, 24 bytes for 13 - 20 and 32 bytes for the rest.

This means that your experiment with the 18218975 strings ("0".."50000000") requires 291503600 bytes on the heap, with the sum of their lengths (plus trailing 0) being 156681483.

Thus you have 135MB overhead simply due to malloc.

(Is this Peak Working Set size a reliable figure?)

Count the occurrence of each element in large data stream

Output

Edit

Answers (1)

Related Questions