Reputation: 2557

Datatype suggestions for storage of a "lot" of strings?

I have an application that queries a database very regularly. It returns up to millions of strings, with a vast majority bieng repeats. I need to store all of these records in memory, and am trying to minimize the footprint.

My current design is to call GetHashCode() on every string, and then store the hash instead of string itself.

I then try to add it to a Dictionary<hashcode,string>() structure. I also keep a second dictionary of Dictionary<hashcode,count>() which is incremented\decremented as more entries use the string.

In the entries dispose method, i decrement the counter, and remove the strings from the dictionary if the usage drops to zero.

So, a few questions:

Is this a fools errand? Is there some datatype I could be using that would save me a lot of time\effort than working with this giant?

I want my string table to be thread safe (which it currently isn't). Is using ConcurrentDictinary my best bet?

Thanks in advance.

Upvotes: 1

Answers (3)

Olivier Jacot-Descombes

Reputation: 112402

I don't see the point of getting the hash code and storing the string in a Dictionary<hash,string> as well as storing the count in a separate dictionary. You can use the string itself as key and the dictionary will create and store the hash code automatically (internally). Therefore using only one dictionary Dictionary<string,count> would be fully sufficient. You can also retrieve the strings from the dictionary through dict.Keys.

The hash code of two different strings can be the same. This is called a collision. The Dictionary<TKey,TValue> handles these collisions automatically.

ConcurrentDictinary<TKey,TValue> seems to be appropriate; however, I don't have any experience with it.

Upvotes: 0

Daniel Earwicker

Reputation: 116674

The main problem with this is that two different strings can have the same hashcode.

It sounds like you're making this more complex than it needs to be. What you need here is internment:

http://msdn.microsoft.com/en-us/library/system.string.intern.aspx

The CLR already maintains a table of string instances to conserve memory.

UPDATE

However... you should bear in mind the warning in the documentation: the interned strings will not be garbage collected until the CLR unloads, i.e. they hang around for the lifetime of your app domain.

But you could implement the same pattern yourself fairly easily:

class LocalStringInterner
{
    private Dictionary<string, string> _strings = new Dictionary<string, string>();

    public string Intern(string str)
    {
        string interned;
        if (_strings.TryGetValue(str, out interned))
            return interned;

        _strings.Add(str, str);
        return str;
    }
}

This way, when you don't need that set of strings any more, you can just abandon the LocalStringInterner.

To make it safe to use from multiple threads, you could wrap the body of Intern in a lock(_strings).

Upvotes: 1

Jürgen Steinblock

Reputation: 31743

Maybe a md5-Hash could help you with that. It should be (theoretically) unique and is supported by most databases (if not C# will help you with that).

MySQL:

SELECT name, md5(name)
FROM user

That said, I would consider a better database approach.

If you have a unique id per string on the server side this should be an easy task.

let's say you have a table called string_resources with a auto_increment id column and a varchar field. I also would add a unique index on value to ensure you don't store a string twice.

|id     | value          |
|1      | Hello          |
|2      | World          |
...
|145789 | Something else |

Now you can just store the int value in your dictionary

md5: 128bit
int32: 32bit  // <-- You Don't Say?

Upvotes: 0

Datatype suggestions for storage of a &quot;lot&quot; of strings?

Answers (3)

Related Questions

Datatype suggestions for storage of a "lot" of strings?