f3lix
f3lix

Reputation: 29879

What are the lesser known but useful data structures?

There are some data structures around that are really useful but are unknown to most programmers. Which ones are they?

Everybody knows about linked lists, binary trees, and hashes, but what about Skip lists and Bloom filters for example. I would like to know more data structures that are not so common, but are worth knowing because they rely on great ideas and enrich a programmer's tool box.

PS: I am also interested in techniques like Dancing links which make clever use of properties of a common data structure.

EDIT: Please try to include links to pages describing the data structures in more detail. Also, try to add a couple of words on why a data structure is cool (as Jonas Kölker already pointed out). Also, try to provide one data-structure per answer. This will allow the better data structures to float to the top based on their votes alone.

Upvotes: 792

Views: 391071

Answers (30)

user97214
user97214

Reputation:

Getting away from all these graph structures, I just love the simple Ring-Buffer.

When properly implemented you can seriously reduce your memory footprint while maintaining performance and sometimes even improving it.

Upvotes: 4

Andrew Whitaker
Andrew Whitaker

Reputation: 126042

The Region Quadtree

(quoted from Wikipedia)

The region quadtree represents a partition of space in two dimensions by decomposing the region into four equal quadrants, subquadrants, and so on with each leaf node containing data corresponding to a specific subregion. Each node in the tree either has exactly four children, or has no children (a leaf node).

Quadtrees like this are good for storing spatial data, e.g. latitude and longitude or other types of coordinates.

This was by far my favorite data structure in college. Coding this guy and seeing it work was pretty cool. I highly recommend it if you're looking for a project that will take some thought and is a little off the beaten path. Anyway, it's a lot more fun than the standard BST derivatives that you're usually assigned in your data structures class!

In fact, as a bonus, I've found the notes from the lecture leading up to the class project (from Virginia Tech) here (pdf warning).

Upvotes: 8

jscoot
jscoot

Reputation: 2109

I stumbled on another data structure Cartesian Tree when i read about some algorithms related to RMQ and LCA. In a cartesian tree, the lowest common ancestor between two nodes is the minimum node between them. It is useful to convert a RMQ problem to LCA.

Upvotes: 0

Adam Rosenfield
Adam Rosenfield

Reputation: 400116

Fibonacci heaps

They're used in some of the fastest known algorithms (asymptotically) for a lot of graph-related problems, such as the Shortest Path problem. Dijkstra's algorithm runs in O(E log V) time with standard binary heaps; using Fibonacci heaps improves that to O(E + V log V), which is a huge speedup for dense graphs. Unfortunately, though, they have a high constant factor, often making them impractical in practice.

Upvotes: 52

hugomg
hugomg

Reputation: 69924

Disjoint Set Forests allow fast membership queries and union operations and are most famously used in Kruskal's Algorithm for minimum spanning trees.

The really cool thing is that both operations have amortized running time proportional to the inverse of the Ackermann Function, making this the "fastest" non-constant time data structure.

Upvotes: 2

Fantius
Fantius

Reputation: 3862

Zobrist Hashing is a hash function generally used for representing a game board position (like in Chess) but surely has other uses. One nice things about it is that is can be incrementally updated as the board is updated.

Upvotes: 5

jyt
jyt

Reputation: 15

I think Cycle Sort is a pretty neat sorting algorithm.

It's a sorting algorithm used to minimize the total number of writes. This is particularly useful when you're dealing with flash memory where the life-span of the flash memory is proportional to the amount of writes. Here is the Wikipedia article, but I recommend going to the first link. (nice visuals!)

Upvotes: 1

dhruvbird
dhruvbird

Reputation: 6189

A queue implemented using 2 stacks is pretty space efficient (as opposed to using a linked list which will have at least a 1 extra pointer/reference overhead).

How to implement a queue using two stacks?

This has worked well for me when the queues are huge. If I save 8 bytes on a pointer, it means that queues with a million entries save about 8MB of RAM.

Upvotes: 3

st0le
st0le

Reputation: 33545

Ternary Search Tree

  • Quick prefix search (for incremental autocomplete,etc)
  • Partial Matching (When you want to find all words within X hamming distance of a string)
  • Wildcard Searches

Quite Easy to implement.

Upvotes: 3

GWW
GWW

Reputation: 44093

I think the FM-index by Paolo Ferragina and Giovanni Manzini is really cool. Especially in bioinformatics. It's essentially a compressed full text index that utilizes a combination of a suffix array and a burrows-wheeler transform of the reference text. The index can be searched without decompressing the whole index.

Upvotes: 3

J D
J D

Reputation: 48687

Right-angle triangle networks (RTINs)

Beautifully simple way to adaptively subdivide a mesh. Split and merge operations are just a few lines of code each.

Upvotes: 0

btilly
btilly

Reputation: 46369

I like Cache Oblivious datastructures. The basic idea is to lay out a tree in recursively smaller blocks so that caches of many different sizes will take advantage of blocks that convenient fit in them. This leads to efficient use of caching at everything from L1 cache in RAM to big chunks of data read off of the disk without needing to know the specifics of the sizes of any of those caching layers.

Upvotes: 26

Daniel Trebbien
Daniel Trebbien

Reputation: 39208

I am not sure if this data structure has a name, but the proposed tokenmap data structure for inclusion into Boost is kind of interesting. It is a dynamically resizable map where look-ups are not only O(1), they are simple array accesses. I wrote most of the background material on this data structure which describes the fundamental principle behind how it works.

Something like a tokenmap is used by operating systems to map file or resource handles to data structures representing the file or resource.

Upvotes: 2

habeanf
habeanf

Reputation: 1

Half edge data structure and winged edge for polygonal meshes.

Useful for computational geometry algorithms.

Upvotes: 1

yonkeltron
yonkeltron

Reputation: 643

XOR Linked List uses two XOR'd pointers to lessen the storage requirements for doubly-linked list. Kind of obscure but neat!

Upvotes: 10

pathikrit
pathikrit

Reputation: 33409

DAWGs are a special kind of Trie where similar child trees are compressed into single parents. I extended modified DAWGs and came up with a nifty data structure called ASSDAWG (Anagram Search Sorted DAWG). The way this works is whenever a string is inserted into the DAWG, it is bucket-sorted first and then inserted and the leaf nodes hold an additional number indicating which permutations are valid if we reach that leaf node from root. This has 2 nifty advantages:

  1. Since I sort the strings before insertion and since DAWGs naturally collapse similar sub trees, I get high level of compression (e.g. "eat", "ate", "tea" all become 1 path a-e-t with a list of numbers at the leaf node indicating which permutations of a-e-t are valid).
  2. Searching for anagrams of a given string is super fast and trivial now as a path from root to leaf holds all the valid anagrams of that path at the leaf node using permutation-numbers.

Upvotes: 6

Jonathan
Jonathan

Reputation: 7604

I really really love Interval Trees. They allow you to take a bunch of intervals (ie start/end times, or whatever) and query for which intervals contain a given time, or which intervals were "active" during a given period. Querying can be done in O(log n) and pre-processing is O(n log n).

Upvotes: 10

juancn
juancn

Reputation: 2613

PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric, D.R.Morrison (1968).

A PATRICIA tree is related to a Trie. The problem with Tries is that when the set of keys is sparse, i.e. when the actual keys form a small subset of the set of potential keys, as is very often the case, many (most) of the internal nodes in the Trie have only one descendant. This causes the Trie to have a high space-complexity.

http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA/

Upvotes: 2

Lukáš Nalezenec
Lukáš Nalezenec

Reputation: 23

Burrows–Wheeler transform (block-sorting compression)

Its essential algorithm for compression. Let say that you want to compress lines on text files. You would say that if you sort the lines, you lost information. But BWT works like this - it reduces entropy a lot by sorting input, keeping integer indexes to recover the original order.

Upvotes: 2

Trey Jackson
Trey Jackson

Reputation: 74420

A corner-stitched data structure. From the summary:

Corner stitching is a technique for representing rectangular two-dimensional objects. It appears to be especially well-suited for interactive editing systems for VLSI layouts. The data structure has two important features: first, empty space is represented explicitly; and second, rectangular areas are stitched together at their corners like a patchwork quilt. This organization results in fast algorithms (linear time or better) for searching, creation, deletion, stretching, and compaction. The algorithms are presented under a simplified model of VLSI circuits, and the storage requirements of the structure are discussed. Measurements indicate that corner stitching requires approximately three times as much memory space as the simplest possible representation.

Upvotes: 2

user201295
user201295

Reputation:

Per the Bloom Filter mentions, Deletable Bloom Filters (DlBF) are in some ways better than basic counting variants. See http://arxiv.org/abs/1005.0352

Upvotes: 4

Sriram Srinivasan
Sriram Srinivasan

Reputation: 1325

Fenwick trees (or binary indexed trees) are a worthy addition to ones toolkit. If you have an array of counters and you need to constantly update them while querying for cumulative counts (as in PPM compression), Fenwick trees will do all operations in O(log n) time and require no extra space. See also this topcoder tutorial for a good introduction.

Upvotes: 5

moinudin
moinudin

Reputation: 138307

A min-max heap is a variation of a heap that implements a double-ended priority queue. It achieves this by by a simple change to the heap property: A tree is said to be min-max ordered if every element on even (odd) levels are less (greater) than all childrens and grand children. The levels are numbered starting from 1.

http://internet512.chonbuk.ac.kr/datastructure/heap/img/heap8.jpg

Upvotes: 27

moinudin
moinudin

Reputation: 138307

An unrolled linked list is a variation on the linked list which stores multiple elements in each node. It can drastically increase cache performance, while decreasing the memory overhead associated with storing list metadata such as references. It is related to the B-tree.

record node {
    node next       // reference to next node in list
    int numElements // number of elements in this node, up to maxElements
    array elements  // an array of numElements elements, with space allocated for maxElements elements
}

Upvotes: 12

Francois G
Francois G

Reputation: 11985

Arne Andersson trees are a simpler alternative to red-black trees, in which only right links can be red. This greatly simplifies maintenance, while keeping performance on par with red-black trees. The original paper gives a nice and short implementation for insertion and deletion.

Upvotes: 6

Francois G
Francois G

Reputation: 11985

Have a look at Finger Trees, especially if you're a fan of the previously mentioned purely functional data structures. They're a functional representation of persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece.

As per the original article:

Our functional 2-3 finger trees are an instance of a general design technique in- troduced by Okasaki (1998), called implicit recursive slowdown. We have already noted that these trees are an extension of his implicit deque structure, replacing pairs with 2-3 nodes to provide the flexibility required for efficient concatenation and splitting.

A Finger Tree can be parameterized with a monoid, and using different monoids will result in different behaviors for the tree. This lets Finger Trees simulate other data structures.

Upvotes: 38

David Phillips
David Phillips

Reputation: 10208

Tries, also known as prefix-trees or crit-bit trees, have existed for over 40 years but are still relatively unknown. A very cool use of tries is described in "TRASH - A dynamic LC-trie and hash data structure", which combines a trie with a hash function.

Upvotes: 270

karlphillip
karlphillip

Reputation: 93410

B* tree

It's a variety of B-tree that is efficient for searching at the cost of a more expensive insertion.

Upvotes: 4

Marko Tintor
Marko Tintor

Reputation: 28

Work Stealing Queue

Lock-free data structure for dividing the work equaly among multiple threads Implementation of a work stealing queue in C/C++?

Upvotes: 22

moritz
moritz

Reputation: 12832

I sometimes use Inversion LIsts to store ranges, and they are often used to store character classes in regular expressions. See for example http://www.ibm.com/developerworks/linux/library/l-cpinv.html

Another nice use case is for weighted random decisions. Suppose you have a list of symbols and associated probabilites, and you want to pick them at random according to these probabilities

   a => 0.1
   b => 0.5
   c => 0.4

Then you do a running sum of all the probabilities:

  (0.1, 0.6, 1.0)

This is your inversion list. You generate a random number between 0 and 1, and find the index of the next higher entry in the list. You can do that with a binary search, because it's sorted. Once you've got the index, you can look up the symbol in the original list.

If you have n symbols, you have O(n) preparation time, and then O(log(n)) acess time for each randomly chosen symbol - independently of the distribution of weights.

A variation of inversion lists uses negative numbers to indicate the endpoint of ranges, which makes it easy to count how many ranges overlap at a certain point. See http://www.perlmonks.org/index.pl?node_id=841368 for an example.

Upvotes: 6

Related Questions