Reputation: 1806
I need a different datastructure for my application.
Essentially I have a custom data structure consisting of "nodes". My task is as follows: Given a number of distinct nodes (where the number of nodes I get is unknown) retrieve or create a new node. It reminds me very much of a cache for a function with multiple arguments. The only difference is that all arguments and the return value have the same type and values I returned might be given to me as input later on.
Example 1: At first I get nodes A and C. Now I have to create a new node (lets name it AC) and return it. When I get the nodes A and C again in the future I need to be able to very quickly determine if I have already created the AC node and return it, or, if it has not been created before, create it and return it.
Example 2: When I get the nodes C and A then I have to return/create a different node! I cannot return AC, it has to be a new node (CA). Order is important!
Later in processing it's also possible that I will get nodes I created earlier. For example in the third call to my datastructure it's entirely possible that I receive the nodes "A and AC". Again I have to create a new node "A-AC", cache it and return it.
At first I used many Dictionary<Tuple<Node, Node>, Node>
but this has multiple problems:
- Creating and comparing with tuples turned out to be too slow for my application
- The number of arguments is fixed, I need multiple dictionaries each with a different key (2-tuples, 3-tuples, ...)
I also have a lot of nodes. I am already filtering the incoming data a bit but I will have to process at least 15 million to 20 million different nodes.
A dictionary doesn't seem to cut it, performance as well as memory consumption seem to be too high.
I can freely modify how the nodes are implemented so maybe there is another trick I can use to directly link many nodes to another one?
How can I solve this problem as efficiently as possible? What datastructure is used for this problem usually?
Upvotes: 1
Views: 473
Reputation: 13803
It seems you have a lot of constraints (time, efficiency, memory footprint). To be honest, I'm not sure where you put the bar for those constraints.
I once created a small data structure that accomplishes something similar to what you want. I think.
public class StackBlock
{
public string Component { get; set; }
public MyObject ResultingObject { get; set; }
public List<StackBlock> Blocks { get; set; }
}
The idea is that you use these to construct a tree that functions as a cache for your already created objects. A small description of what the properties do:
"A"
or "C"
value"AC"
So if you wanted to store your "AC"
object, this would be structure you save it in:
StackBlock
Component: "A"
ResultingObject: null
Blocks: [
StackBlock
Component: "C"
ResultingObject: MyObject "AC"
Blocks: [ ... ]
]
Edit Very simply put, item "CGA" will be found in:
StackBlock "C" -> StackBlock "G" -> StackBlock "A" -> ResultingItem
You can keep stacking blocks together to make longer and longer combinations. But when you need to retrieve an object, all you have to do is traverse the tree in order:
Note that for every step, if you cannot find what you are looking for, that means it was not yet created, so you must create the object and then store it in the tree. The next time you request it, it will now be available.
In your case, "AC" and "CA" are different objects, and the tree allows you to store these in separate locations.
This will also ensure you don't create an object when it was already in memory, because the tree structure only allows for a single place to put a specific element.
I hope this is somewhat in the direction you want?
Note: This is a very old project of mine, before LINQ was introduced. I can only imagine that the resulting code is rather elegant and concise when you use LINQ for traversing the tree. I'll spare you how I used to do it.
Answer to the below comment
If you have item "GA" and item "CGA", they will not be part of the same StackBlock "A". If you've already cached those two objects, the tree will be as follows:
StackBlock C
-> StackBlock G
-> StackBlock A
-> ResultingObject CGA
StackBlock G
-> StackBlock A
-> ResultingObject GA
Note: You'd be storing the upper elements in a List. From that list, you find the first element and start drilling down.
One possible point of confusion I'd like to address: You see two Stackblocks "G" and two StackBlocks "A". These are not the same objects. Every stackblock you see mentioned in all my examples are different objects (that just happen to have the same letter).
Maybe it'd be better to understand if you define StackBlock as a struct instead of a class. It'd work just the same, and you're prevented from reusing the same StackBlock on different tiers of the tree you're constructing.
ResultingObject should therefore not be a list, because there should only be a single object that we call "CGA". The entire point of the exercise is to prevent a duplicate object being created, so this data structure is specifically tailored to only allow one place to put your cached object.
Maybe it helps if I flesh out the example, so you can see where everything would end up:
StackBlock C
-> StackBlock G
-> ResultingObject CG
-> StackBlock A
-> ResultingObject CGA
-> StackBlock B
-> ResultingObject CGB
StackBlock G
-> ResultingObject G
-> StackBlock A
-> ResultingObject GA
-> StackBlock X
-> ResultingObject GAX
-> StackBlock K
-> ResultingObject GAXK
Look at the two Stackblocks called "G". One of them is on the top level, so its ResultingObject is simply G. But the other one is a second-level StackBlock. Therefore, its ResultingObject is CG, because you have to account for the entire chain you drilled down.
I hope this helped clear it up. It's a simple concept once you understand it, but I'm having some difficulty describing why this works :)
Upvotes: 2
Reputation: 121
This is a perfect example of when to use a digital search tree also know as a Trie. Basically each Node has an array of nodes and a null node. As you work your way down the tree as in example 1 to A then to C if that C node make a reference to the null node then you know that you have already loaded that node. if it does not then it hasn't been loaded yet. I don't think that there are any built in implementations of a trie but they are not that hard to build. I built one once to store the english dictionary and used it to check to to see if words existed. If you build it correctly it takes up next to no space in memory and will have access time of O(1).
Upvotes: 2