Christian P.
Christian P.

Reputation: 4884

Ideal data structure with fast lookup, fast update and easy comparison/sorting

I am looking for a good data structure to contain a list of tuples with (hash, timestamp) values. Basically, I want to use it in the following way:

Periodically, I wish to remove and return a list of tuples that older than a specific timestamp (I need to update various other elements when they 'expire'). Timestamp does not have to be anything specific (it can be a unix timestamp, a python datetime object, or some other easy-to-compare hash/string).

I am using this to receive incoming data, update it if it's already present and purge data older than X seconds/minutes.

Multiple data structures can be a valid suggestion as well (I originally went with a priority queue + set, but a priority queue is less-than-optimal for constantly updating values).

Other approaches to achieve the same thing are welcome as well. The end goal is to track when elements are a) new to the system, b) exist in the system already and c) when they expire.

Upvotes: 6

Views: 3689

Answers (5)

AShelly
AShelly

Reputation: 35540

A simple hashtable or dictionary will be O(1) for the check/update/set operations. You could simultaneously store the data in a simple time-ordered list where for the purge operations. Keep a head and tail pointer, so that insert is also O(1) and removal is as simple as advancing the head until it reaches the target time and removing all the entries you find from the hash.

The overhead is one extra pointer per stored data item, and the code is dead simple:

insert(key,time,data):
  existing = MyDictionary.find(key)
  if existing:  
      existing.mark()
  node = MyNodeType(data,time)  #simple container holding args + 'next' pointer
  node.next = NULL
  MyDictionary.insert(key,node)
  Tail.next = node
  if Head is NULL:  Head = node

clean(olderThan):
  while Head.time < olderThan:
    n = Head.next 
    if not Head.isMarked():  
        MyDictionary.remove(Head.key)
    #else it was already overwritten
    if Head == Tail: Tail = n
    Head = n

Upvotes: 0

John Y
John Y

Reputation: 14529

Unless I'm misreading your question, a plain old dict should be ideal for everything except the purging. Assuming you are trying to avoid having to inspect the entire dictionary during purging, I would suggest keeping around a second data structure to hold (timestamp, hash) pairs.

This supplemental data structure could either be a plain old list or a deque (from the collections module). Possibly the bisect module could be handy to keep the number of timestamp comparisons down to a minimum (as opposed to comparing all the timestamps until you reach the cut-off value), but since you'd still have to iterate sequentially over the items that need to be purged, ironing out the exact details of what would be quickest requires some testing.

Edit:

For Python 2.7 or 3.1+, you could also consider using OrderedDict (from the collections module). This is basically a dict with a supplementary order-preserving data structure built into the class, so you don't have to implement it yourself. The only hitch is that the only order it preserves is insertion order, so that for your purpose, instead of just reassigning an existing entry to a new timestamp, you'll need to remove it (with del) and then assign a fresh entry with the new timestamp. Still, it retains the O(1) lookup and saves you from having to maintain the list of (timestamp, hash) pairs yourself; when it comes time to purge, you can just iterate straight through the OrderedDict, deleting entries until you reach one with a timestamp that is later than your cut-off.

Upvotes: 2

SingleNegationElimination
SingleNegationElimination

Reputation: 156158

This is a pretty well trod space. The thing you need is two structures, You need something to tell you wether your key (hash in your case) is known to the collection. For this, dict is a very good fit; we'll just map the hash to the timestamp so you can look up each item easily. Iterating over the items in order of timestamp is a task particularly suited to Heaps, which are provided by the heapq module. Each time we see a key, we'll just add it to our heap, as a tuple of (timestamp, hash).

Unfortunately there's no way to look into a heapified list and remove certain items (because, say, they have been updated to expire later). We'll get around that by just ignoring entries in the heap that have timestamps that are dissimilar from the value in the dict.

So here's a place to start, you can probably add methods to the wrapper class to support additional operations, or change the way data is stored:

import heapq


class ExpiringCache(object):
    def __init__(self):
        self._dict = {}
        self._heap = []

    def add(self, key, expiry):
        self._dict[key] = expiry
        heapq.heappush(self._heap, (expiry, key))

    def contains(self, key):
        return key in self._dict

    def collect(self, maxage):
        while self._heap and self._heap[0][0] <= maxage:
            expiry, key = heapq.heappop(self._heap)
            if self._dict.get(key) == expiry:
                del self._dict[key]

    def items(self):
        return self._dict.items()

create a cache and add some items

>>> xc = ExpiringCache()
>>> xc.add('apples', 1)
>>> xc.add('bananas', 2)
>>> xc.add('mangoes', 3)

re-add an item with an even later expiry

>>> xc.add('apples', 4)

collect everything "older" than two time units

>>> xc.collect(2)    
>>> xc.contains('apples')
True
>>> xc.contains('bananas')
False

Upvotes: 4

Useless
Useless

Reputation: 67743

The closest I can think of to a single structure with the properties you want is a splay tree (with your hash as the key).

By rotating recently-accessed (and hence updated) nodes to the root, you should end up with the least recently-accessed (and hence updated) data at the leaves or grouped in a right subtree.

Figuring out the details (and implementing them) is left as an exercise for the reader ...


Caveats:

  • worst case height - and therefore complexity - is linear. This shouldn't occur with a decent hash
  • any read-only operations (ie, lookups that don't update the timestamp) will disrupt the relationship between splay tree layout and timestamp

A simpler approach is to store an object containing (hash, timestamp, prev, next) in a regular dict, using prev and next to keep an up-to-date doubly-linked list. Then all you need alongside the dict are head and tail references.

Insert & update are still constant time (hash lookup + linked-list splice), and walking backwards from the tail of the list collecting the oldest hashes is linear.

Upvotes: 3

Cameron Sparr
Cameron Sparr

Reputation: 3981

If you're okay with working around occasional false positives, I think that a bloom filter may suit your needs well (it's very very fast)

http://en.wikipedia.org/wiki/Bloom_filter

and a python implementation: https://github.com/axiak/pybloomfiltermmap

EDIT: reading your post again, I think this will work, but instead of storing the hashes, just let the bloomfilter create the hashes for you. ie, I think you just want to use the bloomfilter as a set of timestamps. I'm assuming that your timestamps could basically just be a set since you are hashing them.

Upvotes: 1

Related Questions