prosseek
prosseek

Reputation: 190719

size efficient dictionary(associative array) implementation

What algorithms are available for size efficient A dictionary or associative array? For example, with this key/value set, how can one avoid duplication "Alice" in values?

{
    "Pride and Prejudice": "Alice",
    "The Brothers Karamazov": "Pat",
    "Wuthering Heights": "Alice"
}

I checked Python's implementation on dictionary, but it seems like the implementation is focused on speed (keeping O(1)) not size.

Upvotes: 4

Views: 239

Answers (3)

schesis
schesis

Reputation: 59118

As mentioned by bennofs in comments, you could use intern() to ensure that identical strings are stored only once:

class InternDict(dict):

    def __setitem__(self, key, value):
        if isinstance(value, str):
            super(InternDict, self).__setitem__(key, intern(value))
        else:
            super(InternDict, self).__setitem__(key, value)

Here's an example of the effect that has:

>>> d = {}
>>> d["a"] = "This string is presumably too long to be auto-interned."
>>> d["b"] = "This string is presumably too long to be auto-interned."
>>> d["a"] is d["b"]
False
>>> di = InternDict()
>>> di["a"] = "This string is presumably too long to be auto-interned."
>>> di["b"] = "This string is presumably too long to be auto-interned."
>>> di["a"] is di["b"]
True

Upvotes: 1

Galigator
Galigator

Reputation: 10711

  • If your dictionary can fit in memory then, a simple Hashtable can be use.

Try to insert every key-value in an hashtable. If the key alredy exist before inserting, then you have found a duplication. There is number of implementation of hashtable in many langage.

There is basically twice way : array & tree.

  • Array focus on speed at hight memory cost. The main difference between Hashtable implementation is behavior on unicity, some implementation enforce unicity some others no.

  • Tree focus on memory smart usage at cost of O(log(n)) cpu usage. g++ map relies on very power full red black tree.

If size is very very problematics, then you should search a Huffman compression and/or Lampel Ziv compression, but it cost a little more, to adapt for dictionnary.

  • If your dictionnary can't fit in memory

You should look at database. The red black tree for database is know as BTree (almost). It have branch factor optimizations for the low latency hard drive case.

I have put many link to wikipedia but if you like this subject I recommand you :

Introduction to algorithms

Upvotes: 0

Zim-Zam O'Pootertoot
Zim-Zam O'Pootertoot

Reputation: 18148

One way to improve space efficiency (in addition to sharing values, which (as bennofs points out in the comments) you can probably accomplish efficiently by using sys.intern) is to use hopscotch hashing, which is an open addressing scheme (a variant of linear probing) for resolving collisions - closed addressing schemes use more space because you need to allocate a linked list for each bucket, whereas with an open addressing scheme you'll just use an open adjacent slot in the backing array without needing to allocate any linked lists. Unlike other open addressing schemes (such as cuckoo hashing or vanilla linear probing), hopscotch hashing performs well under a high load factor (over 90%) and guarantees constant time lookups.

Upvotes: 1

Related Questions