SnickersAreMyFave
SnickersAreMyFave

Reputation: 5143

Datastructures, C#: ~O(1) lookup with range keys?

I have a dataset. This dataset will serve a lookup table. Given a number, I should be able to lookup a corresponding value for that number.

The dataset (let's say its CSV) has a few caveats though. Instead of:

1,ABC
2,XYZ
3,LMN

The numbers are ranges (- being "through", not minus):

1-3,ABC     // 1, 2, and 3 = ABC
4-8,XYZ     // 4, 5, 6, 7, 8 = XYZ
11-11,LMN   // 11 = LMN

All the numbers are signed ints. No ranges overlap with another ranges. There are some gaps; there are ranges that aren't defined in the dataset (like 9 and 10 in the last snippet above). `

How might I model this dataset in C# so that I have the most-performant lookup while keeping my in-memory footprint low?

The only option I've come up with suffers from overconsumption of memory. Let's say my dataset is:

1-2,ABC
4-6,XYZ

Then I create a Dictionary<int,string>() whose key/values are:

1/ABC
2/ABC
4/XYZ
5/XYZ
6/XYZ

Now I have hash performance-lookup, but tons of wasted space in the hash table.

Any ideas? Maybe just use PLINQ instead and hope for good performance? ;)

Upvotes: 3

Views: 1860

Answers (5)

Matt Mills
Matt Mills

Reputation: 8792

You can create a doubly-indirected lookup:

Dictionary<int, int> keys;
Dictionary<int, string> values;

Then store the data like this:

keys.Add(1, 1);
keys.Add(2, 1);
keys.Add(3, 1);
//...
keys.Add(11, 3);

values.Add(1, "ABC");
//...
values.Add(3, "LMN");

And then look the data up:

return values[keys[3]];  //returns "ABC"

I'm not sure how much memory footprint this will save with trivial strings, but once you get beyond "ABC" it should help.

EDIT

After Dan Tao's comment below, I went back and checked on what he was asking about. The following code:

var abc = "ABC";
var def = "ABC";
Console.WriteLine(ReferenceEquals(abc, def));

will write "True" to the console. Which means that the either the compiler or the runtime (clarification?) is maintaining the reference to "ABC", and assigns it as the value of both variables.

After reading up some more on Interned strings, if you're using string literals to populate the dictionary, or Interning computed strings, it will in fact take more space to implement my suggestion than the original dictionary would have taken. If you're not using Interned strings, then my solution should take less space.

FINAL EDIT

If you're treating your strings correctly, there should be no excess memory usage from the original Dictionary<int, string> because you can assign them to a variable and then assign that reference as the value (or, if you need to, because you can Intern them)

Just make sure your assignment code includes an intermediate variable assignment:

while (thereAreStringsLeftToAssign)
{
    var theString = theStringToAssign;
    foreach (var i in range)
    {
        strings.Add(i, theString);
    }
}

Upvotes: 4

Dan Tao
Dan Tao

Reputation: 128307

As arootbeer has mentioned in his answer, the following code does not create multiple instances of the string "ABC"; rather, it interns a single instance and assigns a reference to that instance to each KeyValuePair<int, string> in dictionary:

var dictionary = new Dictionary<int, string>();
dictionary[0] = "ABC";
dictionary[1] = "ABC";
dictionary[2] = "ABC";

// etc.

OK, so in the case of string literals, you're only using one string instance per range of keys. Is there a scenario where this wouldn't be the case--that is, where you would be using a separate string instance for each key within the range (this is what I assume you're concerned about when you speak of "overconsumption of memory")?

Honestly, I don't think so. There are scenarios where multiple equivalent string instances may be created without the benefit of interning, yes. But I can't imagine these scenarios would affect what you're trying to do here.

My reasoning is this: you want to assign certain values to different ranges of keys, right? So any time you are defining a key-range-value pairing of this sort, you have a single value and several keys. The single part is what leads me to doubt that you'll ever have multiple instances of the same string, unless it is defined as the value for more than one range.

To illustrate: yes, the following code will instantiate two identical strings:

string x = "ABC";

Console.Write("Type 'ABC' and press Enter: ");
string y = Console.ReadLine();

Console.WriteLine(Equals(x, y));
Console.WriteLine(ReferenceEquals(x, y));

The above program, assuming the user follows instructions and types "ABC," outputs True, then False. So you might think, "Ah, so when a string is only provided at run-time, it isn't interned! So this could be where my values could be duplicated!"

But... again: I don't think so. It all comes back to the fact that you are going to be assigning a single value to a range of keys. So let's say your values come from user input; then your code would look something like this:

var dictionary = new Dictionary<int, string>();

int start, count;
GetRange(out start, out count);
string value = GetValue();

foreach (int key in Enumerable.Range(start, count))
{
    // Look, you're using the same string instance to assign
    // to each key... how could it be otherwise?
    dictionary[key] = value;
}

Now, if you were actually thinking more along the lines of what LBushkin mentions in his answer--that you may potentially have huge ranges, making it impractical to define a KeyValuePair<int, string> for each key within that range (e.g., if you have a range of 1-1000000)--then I would agree that you're best off with some sort of data structure that bases its lookup on a binary search. If that's more your scenario, say so and I will be happy to offer more ideas on that front. (Or you could just take a look at the link LBushkin already posted.)

Upvotes: 1

Luke Schafer
Luke Schafer

Reputation: 9265

arootbeer has a good solution, but one you may find confusing to work with.

Another choice is to use a reference type instead of a string, so that you point to the same reference

class StringContainer { 
    public string Value { get; set; }
}

Dictionary<int, StringContainer> values;

var value1 = new StringContainer { Value = "ABC" };
values.Add(1, value1);
values.Add(2, value1);

They will both point to the same instance of StringContainer

EDIT: Thanks for the comments everyone. This method handles value types other than string, so it might be useful for more than the given example. Also, it is my understanding that strings don't always behave in the manner you would expect from reference values, but I could be wrong.

Upvotes: 0

LBushkin
LBushkin

Reputation: 131676

If your dictionary is going to truly store a wide range of key values, an approach that expands all possible ranges into explicit keys will rapidly consume more memory than you likely have available.

You're best option is to use a data structure that supports some variation of binary search (or other O(log N) lookup technique). Here's a link to a generic RangeDictionary for .NET that uses an OrderedList internally, and has O(log N) performance.

Achieving constant-time O(1) lookup requires that you expand all ranges into explicit keys. This requires both a lot of memory, and can actually degrade performance when you need to split or insert a new range. This probably isn't what you want.

Upvotes: 4

Rafe
Rafe

Reputation: 5295

Use a balanced ordered tree (or something similar) mapping start-of-range to end-of-range and data. This will be easy to implement for non-overlapping ranges.

Upvotes: 0

Related Questions