Reputation: 9144

String container for large N

I am looking for a string container suited for large number of strings (> 10^9). Strings have variable length. It must be fast for insertions and lookup and have frugal memory use. Strings are unordered when container is filled. Average string length is about 10 bytes. Lookup done on exact string value. Erasability - optional. N is unknown in advance. For 64bit architecture. Use case - think about associative array for AWK.

map<string> have about 20-40 bites overhead per string and each insertion calls one malloc (or two). So it is not fast and not frugal.

Can someone point me to a C/C++ library, a data structure or a paper?

Relavant -- Comparison of Hash Table Libraries

EDIT I've removed "big data", bumped N to larger value, clarified requirements.

Upvotes: 0

Answers (4)

jxh

Reputation: 70482

For your problem, a pointer on a 64bit machine nearly matches the length of your data. So using multiple pointers per string in your problem (average length less than 10 bytes) would make the size of the data structure dominate the size of your input.

One general way to deal with that is to not use pointers to represent your strings. A specialized representation using a 32bit offset into a large page where all your strings get stored would halve the pointer memory requirements for you, at the cost of needing to do an addition to a pointer to retrieve your string.

Edit: Below is a sample (untested) implementation of such a representation (using struct for the sake of simplicity, actual implementation would of course only make the user interface public). The representation assumes a hash table insertion, hence leaving room for a next_. Note that the offsets are scaled by size of hash_node to allow for representation in a 32bit offset.

struct hash_node {
    uint32_t next_;
    char * str () { return (const char *)(&next+1); }
    const char * str () const { return (const char *)(&next+1); }
};

struct hash_node_store {
    std::vector<hash_node> page_; /* holds the allocated memory for nodes */
    uint32_t free_;

    hash_node * ptr (uint32_t offset) {
        if (offset == 0) return 0;
        return &page_[offset-1];
    }

    uint32_t allocate (const char *str) {
        hash_node *hn = ptr(free_);
        uint32_t len = strlen(str) + 1;
        uint32_t node_size =
            1 + (len / sizeof(hash_node)) + !!(len % sizeof(hash_node));
        strcpy(hn->str(), str);
        free_ += node_size;
        return 1 + (hn - &page_[0]);
    }
};

A hash table would contain a node store, and a hash bucket vector.

struct hash_table {
    hash_node_store store_;
    std::vector<uint32_t> table_; /* holds allocated memory for buckets */

    uint32_t hash_func (const char *s) { /* ... */ }

    uint32_t find_at (uint32_t node_offset, const char *str);
    bool insert_at (uint32_t &node_offset, const char *str);

    bool insert (const char *str) {
        uint32_t bucket = hash_func(s) % table_.size();
        return insert_at(table_[bucket], str);
    }

    bool find (const char *str) {
        uint32_t bucket = hash_func(s) % table_.size();
        return find_at(table_[bucket], str);
    }
};

Where find_at and insert_at are just simple functions implemented in the expected way.

uint32_t hash_table::find_at (uint32_t node_offset, const char *str) {
    hash_node *hn = store_.ptr(node_offset);
    while (hn) {
        if (strcmp(hn->str(), str) == 0) break;
        node_offset = hn->next_;
        hn = store_.ptr(node_offset);
    }
    return node_offset;
}

bool hash_table::insert_at (uint32_t &node_offset, const char *str) {
    if (! find_at(node_offset, str)) {
        uint32_t new_node = store_.allocate(str);
        store_.ptr(new_node)->next_ = node_offset;
        node_offset = new_node;
        return true;
    }
    return false;
}

Upvotes: 2

Erik P.

Reputation: 1627

Is a (low) false positive rate acceptable? If so, then Bloom filters would be an option. If you'd be content with a false positive rate of one in a million, or 2^(-20), you'd want to use a buffer size in bits of around 30 times the number of strings you expect, or 3*10^10 bits. That's less than 4GB. You'd also need around 20 independent hash functions.

If you can't accept false positives, you should consider putting a Bloom filter in front of whatever other solution you build, to weed out most negatives really quickly.

Upvotes: 1

Tony Delroy

Reputation: 106226

As you're only inserting values, the string data itself can be concatenated as it's inserted - each with a delimiter character such as NUL. The character offset into that single buffer uniquely identifies the string. This means that sets of strings that shared a common substring will each be completely redundantly individually specified, but countering that no effort will be spent trying to find or encoding such factoring: that could backfire for highly unrelated string values (e.g. random text).

To find the strings, a hash table could be used. Given your aim of avoiding frequent dynamic memory allocations, to handle collisions efficiently you'd need to use displacement lists: the idea is that when inserting a string that hashes to an already used bucket, you add an offset (wrapping around the table if necessary) and try that other bucket, continuing until an empty bucket it found. This means you need a list of displacements to try: you can hand-code a finite list(s) to get you started, or even potentially nesting loops over a "big displacement" list whose values are added to those from a "small displacement" list until an empty bucket is found, e.g. two hand coded lists of 10 displacements yield 100 combinations. (Alternative hashing algorithms can be used instead of or combined with displacement lists.) You do need to have a reasonable ratio of total to used buckets though... I'd expect something around 1.2 to work ok typically, with larger values prioritorising speed over space - you could populate your system with sample data and tune to taste.

So, the space requirement is:

total_bytes_of_string_data + N delimiters + total_buckets * sizeof(string_offset)

Where sizeof(string_offset) probably needs 8 bytes as 10^9 * 10 is already more than 2^32.

For 10^9 strings of ~10 characters and 1.2*10^9 buckets, this is around 10^9 * (10+1) + 1.2*10^9 * 8 bytes = 20.6^10^9 bytes or 19.1 GB.

It's worth noting that 64 bit virtual address space means you can safely allocate much more space for the concatenated string data and hash table than you actually need, and only those pages actually accessed will require virtual memory (initially physical memory, but it could later be swapped to disk through normal virtual memory mechanisms).

Discussion

There's no way to guarantee reduction in the string memory usage without assumptions / insights about repetitions in the string data or character set used.

If all the insertions were followed by a huge number of searches, sorting the string data and using binary searches would be an ideal solution. But, for quick insertions interspersed with searches the above is reasonable.

You could also have an index based around balanced binary trees, but to avoid memory allocations for each insertion you'd need to group a lot of nodes into one memory page and manually manage the ordering and splitting thereof on a less granular level: painful to implement. There might be a library doing it already but I haven't heard of one.

You've added "associative arrays in AWK" as an example of what this could be used for. You could simply embed each mapped-to value immediately after its string key in the concatenated data.

Upvotes: 1

amit

Reputation: 178491

There is no silver bullet, but a radix tree gives the advantages of a trie (fast look up and insert, at least asymptotically) - with a better space consumption.

However - both are considered not "cache efficient" - which might be significant especially if iteration over the data is required at some point.

Upvotes: 6

String container for large N

Answers (4)

Related Questions