newmanne
newmanne

Reputation: 2089

Sorted data structure with O(logN) insertion that gives insertion point index

My goal is a sorted data structure that can accomplish 2 things:

  1. Fast insertion (at the location according to sort order)
  2. I can quickly segment my data into the sets of everything greater than or less than or equal to an element. I need to know the size of each of these partitions, and I need to be able to "get" these partitions.

Currently, I'm implementing this in java using an ArrayList which provides #2 very easily since I can perform binary search (Collections.binarySearch) and get an insertion index telling me at what point an element would be inserted. Then based on the fact that indices range from 0 to the size of the array, I immediately know how many elements are greater than my element or smaller than my elements, and I can easily get at those elements (as a sublist). However, this doesn't have property #1, and results in too much array copying.

This makes me want to use something like a SkipList or RedBlackTree that could perform the insertions faster, but then I can't figure out how to satisfy property #2 without making it take O(N) time.

Any suggestions would be appreciated. Thanks

EDIT: Thanks for the answers below that reference data structures that perform the insertion in O(logN) time and that can partition quickly as well, but I want to highlight the size() requirement - I need to know the size of these partitions without having to traverse the entire partition (which, according to this is what the TreeSet does. The reasoning behind this is that in my use case I maintain my data using several different copies of data structures each using a different comparator, and then need to ask "according to what comparator is the set of all things larger than a particular element smallest". In the ArrayList case, this is actually easy and takes only O(YlogN) where Y is the number of comparators, because I just binary search each of the Y arrays and return the arraylist with the highest insertion index. It's unclear to me how I could this with a TreeSet without taking O(YN).

I should also add that an approximate answer for the insertion index would still be valuable even if it couldn't be solved exactly.

Upvotes: 2

Views: 4035

Answers (4)

Alex Frechette
Alex Frechette

Reputation: 211

One easy way to get what you want involves augmenting your favourite binary search tree data structure (red-black trees, AVL trees, etc...) with left and right subtree sizes at each node --- call them L-size and R-size.

Assume that updating these fields in your tree data structures can be done efficiently (say constant time). Then here is what you get:

  • Insertion, deletion, and all the regular binary search tree operations as efficient as your choice of data structure --- O(log n) for red-back trees.
  • Given a key x, you can get the number of elements in your tree that have keys less than x in O(log n) time, by descending down the tree to find the appropriate location for x, summing up the L-sizes (plus one for the actual node you're traversing) each time you "go right". The "greater than" case is symmetrical.
  • Given a key x, you can get the sorted list x_L of elements that are less than x in O(log n + |x_L|) time by, again, descending down the tree to find the appropriate location for x, and each time you go right you tag the node you just traversed, appending it to a list h_L. Then doing in-order traversals of each of the nodes in h_L (in order of addition to h_L) will give you x_L (sorted). The "greater than" case is symmetrical.

Finally, for my answer to work, I need to guarantee you that we can maintain these L- and R-sizes efficiently for your choice of specific tree data structure. I'll consider the case of red-black trees.

Note that maintaining L-sizes and R-sizes is done in constant time for vanilla binary search trees (when you add a node starting from the root, just add one to L-sizes if the node should go in the left subtree, or one to the R-sizes if it goes in the right subtree). Now the additional balancing procedures of red-black trees only alter the tree structure through local rotations of nodes --- see Wikipedia's depiction of rotations in red-black trees. It's easy to see that the post-rotation L-size and R-size of P and Q can be recalculated from the L-sizes and R-sizes of A,B,C. This only adds a constant amount of work to the red-black tree's operations.

Upvotes: 1

fps
fps

Reputation: 34460

Use a common Java TreeSet. Insertion takes O(logN), so #1 of your requirements is done. Here's the qouting from documentation:

This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).

And as it implements the NavigableSet interface, you have #2 or your requirements with the following methods:

These operations are overloaded with versions that include/exclude the bounds provided.

TreeSet is quite flexible: it allows you to define a Comparator to order the Set in a custom way, or you can also rely on the natural ordering of the elements.

EDIT:

As per the requirement of returned subsets size() operation to not be O(n), I'm afraid there's no adhoc implementation in the Java API.

It is true, the set views returned by TreeSet range operations, implement size() by 'jumping' to the first element of the view in O(log n) time, and then iterating over the subsequent elements, adding 1 in each iteration, until the end of the subset is reached.

I must say this is quite unfortunate, since it's not always needed to traverse the returned subset view, but sometimes, knowing the size of the subset in advance can be quite useful (as it's your use case).

So, in order to fulfil your requirement, you need another structure, or at least, an auxiliary structure. After some research, I suggest you use a Fenwick tree. A Fenwick tree is also known as a Binary Indexed Tree (BIT), and can be either immutable or mutable. The immutable version is implemented with an array, while the mutable version could be implemented with a balanced binary tree, i.e. a Red-Black tree (Java TreeSet is actually implemented as a Red-Black tree). Fenwick trees are mainly used to store frequencies and calculate the sum of all frequencies up to a given element in O(log n) time.

Please refer to this question here on Stack Overflow for a complete introduction to this quite unknown but yet incredibly useful structure. (As the explanation is here in Stack Overflow, I won't copy it here).

Here's another Stack Overflow question asking how to properly initialize a Fenwick tree, and here's actual Java code showing how to implement Fenwick tree's operations. Finally, here's a very good theoretic explanation about the structure and the underlying algorithms being used.

The problem with all the samples in the web is that they use the immutable version of the structure, which is not suitable to you, since you need to interleave queries with adding elements to the structure. However, they are all very useful to fully understand the structure and algorithms being used.

My suggestion is that you study Java TreeMap's implementation and see how to modify/extend it so that you can turn it into a Fenwick tree that keeps 1 as a value for every key. This 1 would be each key's frequency. So Fenwick tree's basic operation getSum(someElement) would actually return the size of the subset from first element up to someElement, in O(log n) time.

So the challenge is to implement a balanced tree (a descendant of Java's Red-Black TreeMap, actually), that implements all Fenwick tree's operations you need. I believe you'd be done with getSum(somElement), but maybe you could also extend the returned subtree range views so that they all refer to getSum(someElelment) when implementing size() operation for range views.

Hope this helps, at least I hope it's a good place to start. Please, let me know if you need clarifications, as well as examples.

Upvotes: 4

gknicker
gknicker

Reputation: 5569

If you don't need duplicate elements (or if you can make the elements look distinct), I'd use a java.util.TreeSet. It meets your stated requirements.

  1. O(log n) sorted insertion due to binary tree structure
  2. O(log n) segmentation time using in-place subsets

Unfortunately, the O(log n) segmentation time is effectively slowed to O(n) by your requirement to always know the size of the segment, due to the reason in the answer you linked. The in-place subsets don't know their size until you ask them, and then they count. The counted size is stored, but if the backing set is changed in any way, the subset has to count again.

Upvotes: 2

CoronA
CoronA

Reputation: 8075

I think the best data structure for this problem would be a B-Tree with a dense index. Such a B-Tree is built from: - inner nodes containing only pointers to child nodes - leafs containing pointers to paged arrays - a number of equal-sized-arrays (pages)

Unfortunately there are few generic implementations of a B-Tree in Java, probably because so many Variations exist.

The cost of insertion would be

  • O(log(n)) to find the position
  • O(p) to insert a new element into a page (where p is the constant page size)

Maybe this data structure also covers your segmentation problem. If not: The cost of extracting would be

  • O(log(n)) to find the borders
  • O(e) to copy the extract (where e is the size of the extract)

Upvotes: 1

Related Questions