Reputation: 149

Hashing with large data sets and C implementation

I have a large number of values ranging from 0 - 5463458053. To each value, I wish to map a set containing strings so that the operation lookup, i. e. finding whether a string is present in that set takes the least amount of time. Note that this set of values may not contain all values from (0 - 5463458053), but yes, a large number of them.

My current solution is to hash those values (between 0 - 5463458053) and for each value, have a linked list of strings corresponding to that value. Every time, I want to check for a string in a given set, I hash the value(between 0 - 5463458053), get the linked list, and traverse it to find out whether it contains the aforementioned string or not.

While this might seem easier, it's a little time consuming. Can you think of a faster solution? Also, collisions will be dreadful. They'll lead to wrong results.

The other part is about implementing this in C. How would I go about doing this?

NOTE: Someone suggested using a database instead. I wonder if that'll be useful.

I'm a little worried about running out of RAM naturally. :-)

Upvotes: 1

Answers (5)

Ambroz Bizjak

Reputation: 8105

You can use a single binary search tree (AVL/Red-black/...) to contain all the strings, from all sets, by keying them lexicographically as (set_number, string). You don't need to store sets explicitly anywhere. For example, the comparator defining the order of nodes for the tree could look like:

function compare_nodes (node1, node2) {
    if (node1.set_number < node2.set_number) return LESS;
    if (node1.set_number > node2.set_number) return GREATER;
    if (node1.string < node2.string) return LESS;
    if (node1.string > node2.string) return GREATER;
    return EQUAL;
}

With such a structure, some common operations are possible (but maybe not straightforward).

To find whether a string s exists in the set set_number, simply lookup (set_number, s) in the tree, for an exact match.

To find all strings in the set set_number:

function iterate_all_strings_in_set (set_number) {
    // Traverse the tree from root downwards, looking for the given key. Return
    // wherever the search ends up, whether it found the value or not.
    node = lookup_tree_weak(set_number, "");

    // tree empty?
    if (node == null) {
        return;
    }

    // We may have gotten the greatest node from the previous set,
    // instead of the first node from the set we're interested in.
    if (node.set_number != set_number) {
        node = successor(node);
    }

    while (node != null && node.set_number == set_number) {
        do_something_with(node.string);
        node = successor(node);
    }
}

The above requires O((k+1)*log(n)) time, where k is the number of strings in set_number, and n is the number of all strings.

To find all set numbers with at least one string associated:

function iterate_all_sets ()
{
    node = first_node_in_tree();

    while (node != null) {
        current_set = node.set_number;
        do_something_with(current_set);

        if (cannot increment current_set) {
            return;
        }

        node = lookup_tree_weak(current_set + 1, "");
        if (node.set_number == current_set) {
            node = successor(node);
        }
    }
}

The above requires O((k+1)*log(n)) time, where k is the number of sets with at least one string, and n is the number of all strings.

Note that the above code assumes that the tree is not modified in the "do_something" calls; it may crash if nodes are removed.

Addidionally, here's some real C code which demonstrates this, using my own generic AVL tree implemetation. To compile it, it's enough to copy the misc/ and structure/ folders from BadVPN source somewhere and add an include path there.

Note how my AVL tree does not contain any "data" in its nodes, and how it doesn't do any of its own memory allocation. This comes handy when you have a lot of data to work with. To make it clear: the program below does only a single malloc(), which is the one that allocates the nodes array.

#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <assert.h>

#include <structure/BAVL.h>
#include <misc/offset.h>

struct value {
    uint32_t set_no;
    char str[3];
};

struct node {
    uint8_t is_used;
    struct value val;
    BAVLNode tree_node;
};

BAVL tree;

static int value_comparator (void *unused, void *vv1, void *vv2)
{
    struct value *v1 = vv1;
    struct value *v2 = vv2;

    if (v1->set_no < v2->set_no) {
        return -1;
    }
    if (v1->set_no > v2->set_no) {
        return 1;
    }

    int c = strcmp(v1->str, v2->str);
    if (c < 0) {
        return -1;
    }
    if (c > 0) {
        return 1;
    }
    return 0;
}

static void random_bytes (unsigned char *out, size_t n)
{
    while (n > 0) {
        *out = rand();
        out++;
        n--;
    }
}

static void random_value (struct value *out)
{
    random_bytes((unsigned char *)&out->set_no, sizeof(out->set_no));

    for (size_t i = 0; i < sizeof(out->str) - 1; i++) {
        out->str[i] = (uint8_t)32 + (rand() % 94);
    }
    out->str[sizeof(out->str) - 1] = '\0';
}

static struct node * find_node (const struct value *val)
{
    // find AVL tree node with an equal value
    BAVLNode *tn = BAVL_LookupExact(&tree, (void *)val);
    if (!tn) {
        return NULL;
    }

    // get node pointer from pointer to its value (same as container_of() in Linux kernel)
    struct node *n = UPPER_OBJECT(tn, struct node, tree_node);
    assert(n->val.set_no == val->set_no);
    assert(!strcmp(n->val.str, val->str));

    return n;
}

static struct node * lookup_weak (const struct value *v)
{
    BAVLNode *tn = BAVL_Lookup(&tree, (void *)v);
    if (!tn) {
        return NULL;
    }

    return UPPER_OBJECT(tn, struct node, tree_node);
}

static struct node * first_node (void)
{
    BAVLNode *tn = BAVL_GetFirst(&tree);
    if (!tn) {
        return NULL;
    }

    return UPPER_OBJECT(tn, struct node, tree_node);
}

static struct node * next_node (struct node *node)
{
    BAVLNode *tn = BAVL_GetNext(&tree, &node->tree_node);
    if (!tn) {
        return NULL;
    }

    return UPPER_OBJECT(tn, struct node, tree_node);
}

size_t num_found;

static void iterate_all_strings_in_set (uint32_t set_no)
{
    struct value v;
    v.set_no = set_no;
    v.str[0] = '\0';
    struct node *n = lookup_weak(&v);

    if (!n) {
        return;
    }

    if (n->val.set_no != set_no) {
        n = next_node(n);
    }

    while (n && n->val.set_no == set_no) {
        num_found++; // "do_something_with_string"
        n = next_node(n);
    }
}

static void iterate_all_sets (void)
{
    struct node *node = first_node();

    while (node) {
        uint32_t current_set = node->val.set_no;
        iterate_all_strings_in_set(current_set); // "do_something_with_set"

        if (current_set == UINT32_MAX) {
            return;
        }

        struct value v;
        v.set_no = current_set + 1;
        v.str[0] = '\0';
        node = lookup_weak(&v);

        if (node->val.set_no == current_set) {
            node = next_node(node);
        }
    }
}

int main (int argc, char *argv[])
{
    size_t num_nodes = 10000000;

    // init AVL tree, using:
    //   key=(struct node).val,
    //   comparator=value_comparator
    BAVL_Init(&tree, OFFSET_DIFF(struct node, val, tree_node), value_comparator, NULL);

    printf("Allocating...\n");

    // allocate nodes (missing overflow check...)
    struct node *nodes = malloc(num_nodes * sizeof(nodes[0]));
    if (!nodes) {
        printf("malloc failed!\n");
        return 1;
    }

    printf("Inserting %zu nodes...\n", num_nodes);

    size_t num_inserted = 0;

    // insert nodes, giving them random values
    for (size_t i = 0; i < num_nodes; i++) {
        struct node *n = &nodes[i];

        // choose random set number and string
        random_value(&n->val);

        // try inserting into AVL tree
        if (!BAVL_Insert(&tree, &n->tree_node, NULL)) {
            printf("Insert collision: (%"PRIu32", '%s') already exists!\n", n->val.set_no, n->val.str);
            n->is_used = 0;
            continue;
        }

        n->is_used = 1;
        num_inserted++;
    }

    printf("Looking up...\n");

    // lookup all those values
    for (size_t i = 0; i < num_nodes; i++) {
        struct node *n = &nodes[i];
        struct node *lookup_n = find_node(&n->val);

        if (n->is_used) { // this node is the only one with this value

            ASSERT(lookup_n == n)
        } else { // this node was an insert collision; some other
                 // node must have this value
            ASSERT(lookup_n != NULL) 
            ASSERT(lookup_n != n)
        }
    }

    printf("Iterating by sets...\n");

    num_found = 0;
    iterate_all_sets();
    ASSERT(num_found == num_inserted)

    printf("Removing all strings...\n");

    for (size_t i = 0; i < num_nodes; i++) {
        struct node *n = &nodes[i];
        if (!n->is_used) { // must not remove it it wasn't inserted
            continue;
        }
        BAVL_Remove(&tree, &n->tree_node);
    }

    return 0;
}

Upvotes: 2

Damon

Reputation: 70206

Large number of strings + fast lookup + limited memory ----> you want a prefix trie, crit-bit tree, or anything of that family (many different names for very similar things, e.g. PATRICIA... Judy is one such thing too). See for example this.

These data structores allow for prefix-compression, so they are able to store a lot of strings (which somehow necessarily will have common prefixes) very efficiently. Also, lookup is very fast. Due to caching and paging effects that the common big-O notation does not account for, they can be as fast or even faster than a hash, at a fraction of the memory (even though according to big-O, nothing except maybe an array can beat a hash).

Upvotes: 2

wildplasser

Reputation: 44250

If the entries are from 0 to N and consecutive: use an array. (Is indexing fast enough for you?)

EDIT: the numbers do not seem to be consecutive. There is a large number of {key,value} pairs, where the key is a big number (>32 bits but < 64 bits) and the value is a bunch of strings.

If memory is available, a hash table is easy, if the bunch of strings is not too large you can inspect them sequentially. If the same strings occur (much) more than once, you could enumerate the strings (put pointers to them in a char * array[] and use the index into that array instead. finding the index given a string probably involves another hash table)

For the "master" hashtable an entry would probably be:

struct entry {
  struct entry *next; /* for overflow chain */
  unsigned long long key; /* the 33bits number */
  struct list *payload;
  } entries[big_enough_for_all] ; /* if size is known in advance
                          , preallocation avoids a lot of malloc overhead */

if you have enough memory to store a heads-array, you chould certainly do that:

struct entry *heads[SOME_SIZE] = {NULL, };

, otherwise you can combine the heads array with the array of entries. (like I did Lookups on known set of integer keys here)

Handling collisions is easy: as you walk the overflow chain, just compare your key with the key in the entry. If they are unequal: walk on. If they are equal: found; now go walking the strings.

Upvotes: 1

Dan Aloni

Reputation: 4108

A Judy Array, with the C library that implements it, might be exactly the base of what you need. Here's a quote that describes it:

Judy is a C library that provides a state-of-the-art core technology that implements a sparse dynamic array. Judy arrays are declared simply with a null pointer. A Judy array consumes memory only when it is populated, yet can grow to take advantage of all available memory if desired. Judy's key benefits are scalability, high performance, and memory efficiency. A Judy array is extensible and can scale up to a very large number of elements, bounded only by machine memory. Since Judy is designed as an unbounded array, the size of a Judy array is not pre-allocated but grows and shrinks dynamically with the array population. Judy combines scalability with ease of use. The Judy API is accessed with simple insert, retrieve, and delete calls that do not require extensive programming. Tuning and configuring are not required (in fact not even possible). In addition, sort, search, count, and sequential access capabilities are built into Judy's design.

Judy can be used whenever a developer needs dynamically sized arrays, associative arrays or a simple-to-use interface that requires no rework for expansion or contraction.

Judy can replace many common data structures, such as arrays, sparse arrays, hash tables, B-trees, binary trees, linear lists, skiplists, other sort and search algorithms, and counting functions.

Upvotes: 2

Basile Starynkevitch

Reputation: 1

You could have an hash-table of hash-sets. The first hash-table has keys your integers. The values inside it are hash-sets, i.e. hash-tables whose keys are strings.

You could also have an hashed set, with the keys being pairs of integers and strings.

There are many libraries implementing such data structures (and in C++, the standard library is implementing them, as std::map & std::set). For C, I was thinking of Glib from GTK.

With hashing techniques, memory use is proportional to the size of the considered sets (or relations). For instance, you could accept 30% emptiness rate.

Upvotes: 3

Hashing with large data sets and C implementation

Answers (5)

Related Questions