Dolda2000
Dolda2000

Reputation: 25855

Data structures for indexing on object class

I want to be able to keep a collection of objects and do lookups on it based on the type of the object, where the types can be hierarchical, like classes in a multiple inheritance OO system.

Right now I do this simply by keeping a list of the objects and looping through then, querying each to see if it is of the requested type, kind of like this, in Python-like pseudo-code:

def hastype(objects, type):
    for obj in objects:
        if isinstance(obj, type):
            return obj
    return None

Oftentimes this is not particularly a problem for me, but there are cases where it would be nice to be able to do these lookups more efficiently.

As mentioned, my types are very similar to classes in a multiple-inheritance system; each type declares any number of direct supertypes, and is given a complete list of direct and indirect supertypes from those. There is a type root. I can easily query the complete list of supertypes for a type. I also have global knowledge of all known types in the system, each of which has an integer ID, and the IDs are allocated contiguously, if that helps.

The main characteristic I care about is quick lookup regardless of how many objects are in the collection (it doesn't need to be O(1), but something better than O(n) would be nice), but I also care quite a bit about efficient insertion and removal (preferably regardless both of how many objects are in the collection and of how many supertypes are in the object's type, but I'm willing to buy that those criteria may be mutually exclusive), and also about the amount of memory used.

I've searched for some already invented data structure of this kind, but I haven't found any; and I also haven't been quite able to think of any myself that fits my needs as described above (for example, given the contiguous type IDs, it would be easy to create a direct lookup table from types to objects with O(1) lookup, but that would use far too much memory).

Does anyone know of, or can think of, any data structure of this kind?

Upvotes: 1

Views: 83

Answers (2)

Nuclearman
Nuclearman

Reputation: 5304

Tom Dalling's method is fairly close to optimal in terms of memory costs. However, as mentioned there are algorithms that can trade those costs for faster supertype and counting number of direct/indirect supertypes a type has. Below are a couple algorithms that do so, and it's up to you to determine if the trade offs are worth it. In the end, the performance of both algorithms depend largely on what the type graph (connections between subtype and supertype) looks like. If the type graph is fairly spare or otherwise favorable (the performance related variables are closer to the lower end of the performance bounds), then the average (amortized) performance of the below algorithms can make them worth using.

Performance related variables:

  • N is the number of types.
  • D is the average depth (how far down the subtypes go). Bounds O(1) to O(N).
  • M is value of the highest numbered ID that is a subtype to a given type. Bounds O(1) to O(N).
  • k is the number of direct supertypes a type has. Bounds O(1) to O(N).
  • K is the average number of total unique supertypes a type has. Bounds O(1) to O(N).
  • L is the average number of total unique substypes a type has. Bounds O(1) to O(N).
  • E is the number of subtype-supertype connections. Bounds O(N) to O(N^2).

Algorithms:

  1. O(1) supertype lookups with O(N*D) extra space cost. The idea is to have each type maintain a (dynamic) boolean array of all of it's super types. The supertype array would have a size equal to the maximum supertype ID number. The array would be build by copying the supertype arrays for each of the inherted supertypes, then adding the each of ids for the inherited supertypes themselves. The Pythonic check for if type has a supertype would be something like this:

    return len(supertype_array) > supertype_id and supertype_array[supertype_id] is True
    

    Adding a subtype is equal to doing set unions on the supertypes list for each direct supertype, which is O(k*N).

  2. An alternative method provides superior space performance to #1 if E is relatively close to N at somewhat higher costs elsewhere. Supertype lookups are O(log N) and adding a subtype here is equal to doing set unions on the supertypes list for each direct supertype, but ends up being linear in the sum of the elements of each supertype list. The idea is to use a bitwise trie of IDs whenever it would take up less space than a boolean array. The benefit can clearly be seen if the ID numbers are 10,20,and 1000. A bitwise trie will require far fewer bits than the 1000 required in the boolean array. However, if the IDs are 1,2,3,4,5,...,100, then it requires at least 573 bits (calculation) for the bitwise Trie, while only requiring 100 for the boolean array. It wouldn't be too difficult to keep track of an upper limit how many bits are in each boolean array or bitwise Trie to determine when a subclass should use a boolean array (when the array is would be sufficiently full) and when a bitwise trie should be used based on total number of bits in the supertypes. Copying from a trie to a new trie is linear in the number of bits. While copying from a boolean array to a trie is linearlog in the number of bits. Determining if a type has a given supertype simply requires doing either a look-up like in #1 if a boolean array is more space efficient, otherwise a binary search is used. You could also use something like an y-fast trie if your inclined to try to implement it. A bitwise radix trie might increase the space efficiency.

Insert/delete costs for both algorithms is the same as Tom Dalling's, though a Radix Trie might be faster/more space efficient. It also wouldn't be difficult to keep a counter for number of supertypes for each type, but that requires an additional O(N log N) extra space.

Note that the size requirements assume that the minimium number of bits are used to represent a number to minimize space. Loping off those insignificant bits shouldn't add more than a factor of O(log N) to the time performances.

Upvotes: 0

Tom Dalling
Tom Dalling

Reputation: 24125

Ok, I'll have a crack at it. If you're worried about memory constraints, then it might not be what you are looking for.

Here is some ruby-ish code:

# hash of all objects by type
#
# heirarchy:
#
#   animal
#     amphibian
#     mammal
#       hominid
#
objects_by_type = {
  animal: [:snake, :fish]
  amphibian: [:frog, :newt]
  mammal: [:whale, :rabbit]
  hominid: [:gorilla, :chimpanzee]
}

# print all objects that are of type `search_type`, or a subtype of `search_type`
def print_objects_of_type(search_type)
  #get a list of all valid types
  all_types = [search_type] + search_type.subtypes

  #print all objects belonging to a type in all_types
  all_types.each do |t|
    objects_by_type[t].each do |obj|
      print obj.to_s + ' '
    end
  end

  print "\n"
end

print_objects_of_type(:animal)
# snake fish frog newt whale rabbit gorilla chimpanzee human

print_objects_of_type(:mammal)
# whale rabbit gorilla chimpanzee human

print_objects_of_type(:amphibian)
# frog newt

This all hinges on a hash where the key is a type, and the value is a list of objects.

Searching for objects of a given type will be better than O(n) because you go directly to the correct objects, without testing incorrect ones. The hash lookup will be O(1), and the rest depends on how fast you can get the list of subtypes for a given type.

For insertion and removal, you should be able to achieve O(1) as long as the lists of objects are linked lists. Insertion and removal will require one hash table lookup (O(1)) and one insert/delete on the linked list (also O(1)).

Now, the only problem is the amount of memory that this approach requires. The number of types affects the hash table memory usage, and the number of objects affects the linked list memory usage. You could replace the linked lists with contiguous memory (like a C++ std::vector), which can get rid of the per-object overhead, but then insertion/removal will not be O(1) any more. You'll just have to calculate the per-type and per-object overhead, multiply it by the expected number of types and objects, and make a decision from there.

All the solutions I can think of require a hash table, so if that has too much memory overhead, then I'm out of ideas.

Upvotes: 2

Related Questions