Reputation: 29879
There are some data structures around that are really useful but are unknown to most programmers. Which ones are they?
Everybody knows about linked lists, binary trees, and hashes, but what about Skip lists and Bloom filters for example. I would like to know more data structures that are not so common, but are worth knowing because they rely on great ideas and enrich a programmer's tool box.
PS: I am also interested in techniques like Dancing links which make clever use of properties of a common data structure.
EDIT: Please try to include links to pages describing the data structures in more detail. Also, try to add a couple of words on why a data structure is cool (as Jonas Kölker already pointed out). Also, try to provide one data-structure per answer. This will allow the better data structures to float to the top based on their votes alone.
Upvotes: 792
Views: 391071
Reputation:
Getting away from all these graph structures, I just love the simple Ring-Buffer.
When properly implemented you can seriously reduce your memory footprint while maintaining performance and sometimes even improving it.
Upvotes: 4
Reputation: 126042
The Region Quadtree
(quoted from Wikipedia)
The region quadtree represents a partition of space in two dimensions by decomposing the region into four equal quadrants, subquadrants, and so on with each leaf node containing data corresponding to a specific subregion. Each node in the tree either has exactly four children, or has no children (a leaf node).
Quadtrees like this are good for storing spatial data, e.g. latitude and longitude or other types of coordinates.
This was by far my favorite data structure in college. Coding this guy and seeing it work was pretty cool. I highly recommend it if you're looking for a project that will take some thought and is a little off the beaten path. Anyway, it's a lot more fun than the standard BST derivatives that you're usually assigned in your data structures class!
In fact, as a bonus, I've found the notes from the lecture leading up to the class project (from Virginia Tech) here (pdf warning).
Upvotes: 8
Reputation: 2109
I stumbled on another data structure Cartesian Tree when i read about some algorithms related to RMQ and LCA. In a cartesian tree, the lowest common ancestor between two nodes is the minimum node between them. It is useful to convert a RMQ problem to LCA.
Upvotes: 0
Reputation: 400116
They're used in some of the fastest known algorithms (asymptotically) for a lot of graph-related problems, such as the Shortest Path problem. Dijkstra's algorithm runs in O(E log V) time with standard binary heaps; using Fibonacci heaps improves that to O(E + V log V), which is a huge speedup for dense graphs. Unfortunately, though, they have a high constant factor, often making them impractical in practice.
Upvotes: 52
Reputation: 69924
Disjoint Set Forests allow fast membership queries and union operations and are most famously used in Kruskal's Algorithm for minimum spanning trees.
The really cool thing is that both operations have amortized running time proportional to the inverse of the Ackermann Function, making this the "fastest" non-constant time data structure.
Upvotes: 2
Reputation: 3862
Zobrist Hashing is a hash function generally used for representing a game board position (like in Chess) but surely has other uses. One nice things about it is that is can be incrementally updated as the board is updated.
Upvotes: 5
Reputation: 15
I think Cycle Sort is a pretty neat sorting algorithm.
It's a sorting algorithm used to minimize the total number of writes. This is particularly useful when you're dealing with flash memory where the life-span of the flash memory is proportional to the amount of writes. Here is the Wikipedia article, but I recommend going to the first link. (nice visuals!)
Upvotes: 1
Reputation: 6189
A queue implemented using 2 stacks is pretty space efficient (as opposed to using a linked list which will have at least a 1 extra pointer/reference overhead).
How to implement a queue using two stacks?
This has worked well for me when the queues are huge. If I save 8 bytes on a pointer, it means that queues with a million entries save about 8MB of RAM.
Upvotes: 3
Reputation: 33545
Quite Easy to implement.
Upvotes: 3
Reputation: 44093
I think the FM-index by Paolo Ferragina and Giovanni Manzini is really cool. Especially in bioinformatics. It's essentially a compressed full text index that utilizes a combination of a suffix array and a burrows-wheeler transform of the reference text. The index can be searched without decompressing the whole index.
Upvotes: 3
Reputation: 48687
Right-angle triangle networks (RTINs)
Beautifully simple way to adaptively subdivide a mesh. Split and merge operations are just a few lines of code each.
Upvotes: 0
Reputation: 46369
I like Cache Oblivious datastructures. The basic idea is to lay out a tree in recursively smaller blocks so that caches of many different sizes will take advantage of blocks that convenient fit in them. This leads to efficient use of caching at everything from L1 cache in RAM to big chunks of data read off of the disk without needing to know the specifics of the sizes of any of those caching layers.
Upvotes: 26
Reputation: 39208
I am not sure if this data structure has a name, but the proposed tokenmap
data structure for inclusion into Boost is kind of interesting. It is a dynamically resizable map where look-ups are not only O(1), they are simple array accesses. I wrote most of the background material on this data structure which describes the fundamental principle behind how it works.
Something like a tokenmap is used by operating systems to map file or resource handles to data structures representing the file or resource.
Upvotes: 2
Reputation: 1
Half edge data structure and winged edge for polygonal meshes.
Useful for computational geometry algorithms.
Upvotes: 1
Reputation: 643
XOR Linked List uses two XOR'd pointers to lessen the storage requirements for doubly-linked list. Kind of obscure but neat!
Upvotes: 10
Reputation: 33409
DAWGs are a special kind of Trie where similar child trees are compressed into single parents. I extended modified DAWGs and came up with a nifty data structure called ASSDAWG (Anagram Search Sorted DAWG). The way this works is whenever a string is inserted into the DAWG, it is bucket-sorted first and then inserted and the leaf nodes hold an additional number indicating which permutations are valid if we reach that leaf node from root. This has 2 nifty advantages:
Upvotes: 6
Reputation: 7604
I really really love Interval Trees. They allow you to take a bunch of intervals (ie start/end times, or whatever) and query for which intervals contain a given time, or which intervals were "active" during a given period. Querying can be done in O(log n) and pre-processing is O(n log n).
Upvotes: 10
Reputation: 2613
PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric, D.R.Morrison (1968).
A PATRICIA tree is related to a Trie. The problem with Tries is that when the set of keys is sparse, i.e. when the actual keys form a small subset of the set of potential keys, as is very often the case, many (most) of the internal nodes in the Trie have only one descendant. This causes the Trie to have a high space-complexity.
http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA/
Upvotes: 2
Reputation: 23
Burrows–Wheeler transform (block-sorting compression)
Its essential algorithm for compression. Let say that you want to compress lines on text files. You would say that if you sort the lines, you lost information. But BWT works like this - it reduces entropy a lot by sorting input, keeping integer indexes to recover the original order.
Upvotes: 2
Reputation: 74420
A corner-stitched data structure. From the summary:
Corner stitching is a technique for representing rectangular two-dimensional objects. It appears to be especially well-suited for interactive editing systems for VLSI layouts. The data structure has two important features: first, empty space is represented explicitly; and second, rectangular areas are stitched together at their corners like a patchwork quilt. This organization results in fast algorithms (linear time or better) for searching, creation, deletion, stretching, and compaction. The algorithms are presented under a simplified model of VLSI circuits, and the storage requirements of the structure are discussed. Measurements indicate that corner stitching requires approximately three times as much memory space as the simplest possible representation.
Upvotes: 2
Reputation:
Per the Bloom Filter mentions, Deletable Bloom Filters (DlBF) are in some ways better than basic counting variants. See http://arxiv.org/abs/1005.0352
Upvotes: 4
Reputation: 1325
Fenwick trees (or binary indexed trees) are a worthy addition to ones toolkit. If you have an array of counters and you need to constantly update them while querying for cumulative counts (as in PPM compression), Fenwick trees will do all operations in O(log n) time and require no extra space. See also this topcoder tutorial for a good introduction.
Upvotes: 5
Reputation: 138307
A min-max heap is a variation of a heap that implements a double-ended priority queue. It achieves this by by a simple change to the heap property: A tree is said to be min-max ordered if every element on even (odd) levels are less (greater) than all childrens and grand children. The levels are numbered starting from 1.
http://internet512.chonbuk.ac.kr/datastructure/heap/img/heap8.jpg
Upvotes: 27
Reputation: 138307
An unrolled linked list is a variation on the linked list which stores multiple elements in each node. It can drastically increase cache performance, while decreasing the memory overhead associated with storing list metadata such as references. It is related to the B-tree.
record node {
node next // reference to next node in list
int numElements // number of elements in this node, up to maxElements
array elements // an array of numElements elements, with space allocated for maxElements elements
}
Upvotes: 12
Reputation: 11985
Arne Andersson trees are a simpler alternative to red-black trees, in which only right links can be red. This greatly simplifies maintenance, while keeping performance on par with red-black trees. The original paper gives a nice and short implementation for insertion and deletion.
Upvotes: 6
Reputation: 11985
Have a look at Finger Trees, especially if you're a fan of the previously mentioned purely functional data structures. They're a functional representation of persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece.
As per the original article:
Our functional 2-3 finger trees are an instance of a general design technique in- troduced by Okasaki (1998), called implicit recursive slowdown. We have already noted that these trees are an extension of his implicit deque structure, replacing pairs with 2-3 nodes to provide the flexibility required for efficient concatenation and splitting.
A Finger Tree can be parameterized with a monoid, and using different monoids will result in different behaviors for the tree. This lets Finger Trees simulate other data structures.
Upvotes: 38
Reputation: 10208
Tries, also known as prefix-trees or crit-bit trees, have existed for over 40 years but are still relatively unknown. A very cool use of tries is described in "TRASH - A dynamic LC-trie and hash data structure", which combines a trie with a hash function.
Upvotes: 270
Reputation: 93410
It's a variety of B-tree that is efficient for searching at the cost of a more expensive insertion.
Upvotes: 4
Reputation: 28
Work Stealing Queue
Lock-free data structure for dividing the work equaly among multiple threads Implementation of a work stealing queue in C/C++?
Upvotes: 22
Reputation: 12832
I sometimes use Inversion LIsts to store ranges, and they are often used to store character classes in regular expressions. See for example http://www.ibm.com/developerworks/linux/library/l-cpinv.html
Another nice use case is for weighted random decisions. Suppose you have a list of symbols and associated probabilites, and you want to pick them at random according to these probabilities
a => 0.1 b => 0.5 c => 0.4
Then you do a running sum of all the probabilities:
(0.1, 0.6, 1.0)
This is your inversion list. You generate a random number between 0 and 1, and find the index of the next higher entry in the list. You can do that with a binary search, because it's sorted. Once you've got the index, you can look up the symbol in the original list.
If you have n
symbols, you have O(n) preparation time, and then O(log(n)) acess time for each randomly chosen symbol - independently of the distribution of weights.
A variation of inversion lists uses negative numbers to indicate the endpoint of ranges, which makes it easy to count how many ranges overlap at a certain point. See http://www.perlmonks.org/index.pl?node_id=841368 for an example.
Upvotes: 6