Reputation: 834
I am developing a C# application which needs to process approximately 4,000,000 english sentences. All these sentences are being stored in a tree. Where each node in the tree is a class which has these fields:
class TreeNode
{
protected string word;
protected Dictionary<string, TreeNode> children;
}
My problem is that the application is using up all the RAM (I have 2 GB RAM) when it reaches the 2,000,000th sentence. So it only manages to process half the sentences and then it slows down drastically.
What can I do to try and reduce the memory footprint of the application?
EDIT: Let me explain a bit more my application. So I have approximately 300,000 english sentences, and from each sentence I am generating further sub sentences like this:
Example: Sentence: Football is a very popular sport Sub Sentences I need:
Each sentence is stored in a tree word by word. So considering the example above, i have a TreeNode Class with the word field = "Football", and the children list has the TreeNode for the word "is". The child of the "is" node is the "a" node. The child for the "a" node is the "very" node. I need to store the sentences word by word since i need to be able to search for all the sentences starting with Example: "Football is".
So basically for each word in a sentence i am creating a new (sub-sentence). And this is the reason I ultimately end up with 4,000,000 different sentences. Storing the data in a database is not an option, since the app needs to work on the whole structure at once. And it will further slow down the process if i had to stay writing all the data to a database.
Thanks
Upvotes: 9
Views: 4221
Reputation: 76500
Could you map each word to an int? That way you have one map of int to string that contains unique English words and a tree structure that contains sentences like so:
class TreeNode
{
protected int word;
protected Dictionary<int, TreeNode> children;
}
Dictionary<string, int> _AllWords;
Now the _AllWords
collection is not optimal for looking up words based on a key as is. What you probably want here is something like a multi-key list where you can do fast lookup based on both key and value. CodeProject has an article about that.
Upvotes: 2
Reputation: 187
Great question, and some great answers. I learned a lot. The StringCache idea deserves some research.
I want to respond to the "I can't use a database because I need it all in memory" point. In many cases, a database is actually the best solution.
Consider that a robust SQL database engine (I'm a MSSQL guy):
The dynamic caching could be a huge benefit for this solution set. Assuming your corpus consists only of "normal" sentences, the word distribution will not be uniform. The most frequent words will be accessed several orders of magnitude more often than the least frequent. It's also likely the frequent words will be added to the dictionary very early, and therefore will be stored close together in the database. A good SQL engine will cache the most frequently used blocks in memory, which naturally favors the kind of searches you describe.
A hybrid solution might look like this:
A table with appropriate indexes
create table myWords (wordKey int identity, word varchar(50))
create unique index iword
on myWords(word) -- used for adds and retrieval
create unique index iwordKey
on myWords(wordKey) -- used for mapping keys back to words
A stored procedure for adding/finding words. Stored procedures conveniently return an int.
create procedure addWord (@word varchar(50))
as
begin
declare @wordKey int, @rows int
insert myWords (word)
select @word
where not exists (select 1 from myWords where word = @word)
select @wordKey = @@identity, @rows = @@rowcount
if @rows = 0
begin
select @wordKey = wordKey
from myWords
where word = @word
end
return @wordKey
end
The application adds the words to the database, builds the tree in memory using only the wordKey values.
You could trade a little speed building the database to further optimize the benefit from caching the most frequent words.
usageCount int
). Inserts set it to 1, updates increment.Even if your corpus grows in the future, the word frequencies are unlikely to change enough to affect efficiency.
Upvotes: 0
Reputation: 83
To reduce the memory footprint you should look for Sequential Data Cache.
It makes possible to reduce the memory footprint by the collection you use. (The collection item must be marked with [Serializable])
You can even make the collection permanent by passing deleteOnClose:false parameter
Sample
using (var c = SequentialDataCache<TreeNode>.Initialize(deleteOnClose: false))
{
//add items to collection
for (int i = 0; i < 1000; i++)
{
var treeNode = new TreeNode()
{
Word = string.Format("Word{0}", i),
Children = new Dictionary<string, TreeNode>()
};
for (int j = 0; j < 100; j++)
{
var child = new TreeNode() { Word = string.Format("Word{0}", j) };
treeNode.Children.Add(string.Format("key{0}{1}", i, j), child);
}
c.Add(treeNode);
}
//assert query
Assert.AreEqual("Word0", c[0].Word);
Assert.AreEqual("Word1", c[0].Children["key01"].Word);
Assert.AreEqual("Word100", c[100].Word);
}
and the TreeNode...
[Serializable]
class TreeNode
{
private string word;
private Dictionary<string, TreeNode> children;
public string Word
{
get { return word; }
set { word = value; }
}
public Dictionary<string, TreeNode> Children
{
get { return children; }
set { children = value; }
}
}
Upvotes: 1
Reputation: 39697
Some points to think about.
class TreeNode
{
protected byte[] word;
protected Dictionary<byte[], TreeNode> children;
public string Word
{
get { return Encoding.UTF8.GetString(word); }
set { word = Encoding.UTF8.GetBytes(value); }
}
public TreeNode GetChildByKey( string key )
{
TreeNode node;
if(children.TryGetValue( Encoding.UTF8.GetBytes(key), out node ))
{
return node;
}
return null;
}
}
[Edit] And I forgot that you also need a new comparer for the byte[] key.
var children = new Dictonary<string,TreeNode>(new ByteArrayComparer);
public class ByteArrayComparer : IEqualityComparer<byte[]>
{
public bool Equals(byte[] x, byte[] y)
{
if (x.Length != y.Length)
return false;
for (int i = 0; i < x.Length; i++)
{
if (x[i] != y[i])
return false;
}
return true;
}
public int GetHashCode(byte[] a)
{
return a[0] | (int)a[1] << 8 | (int)a[2] << 16 | (int)a[3] << 24;
}
}
Upvotes: 2
Reputation: 1252
The only way you can significantly reduce memory usage is by not keeping the sentences in memory.
What are you trying to accomplish? Why are you building a tree? If you are counting something, count and discard the strings as your read them in. If you are building a graph (ie. to analysis relationships between sentence and/or words), try enumerating the sentences and words to they can be unique/key by that id. Use that id in memory instead.
I hope this helps.
Upvotes: 0
Reputation: 62096
It might be overkill for your situation but you could store your nodes in files on disk and use a B-Tree implementation to maximize IO performance. This is what most databases use internally because there is simply too much data to store in memory.
Upvotes: 1
Reputation: 129
If your requirement is for performance and you feel as though you need all words in memory then I'd suggest you use a string array to contain all words. Then store all indexes into a sorted binary tree.
Upvotes: 2
Reputation: 1062502
What is it you are using as the key? Where are you getting the data from? If these are words (not full setences), I'm wondering if you have a lot of duplicated keys (different string
instances with the same fundamental value), in which case you might benefit from implementing a local interner to re-use the values (and let the transient copies get garbage collected).
public sealed class StringCache {
private readonly Dictionary<string,string> values
= new Dictionary<string,string>(StringComparer.Ordinal);
public string this[string value] {
get {
string cached;
if (!values.TryGetValue(value, out cached)) {
values.Add(value, value);
cached = value;
}
return cached;
}
}
}
Instantiate this when building the tree, and use (when you think a value is likely to be duplicated):
StringCache cache = new StringCache(); // re-use this instance while building
// your tree
...
string s = ... // whatever (from reading your input)
s = cache[s];
Upvotes: 11
Reputation: 25704
The Dictionary type itself can consume a lot of memory. Have you considered using a List<KeyValuePair<string, TreeNode>>
instead? The generic List
uses a lot less memory per instance than a generic Dictionary
.
Of course, the limitation of using a List instead of a Dictionary is that you don't get automatic indexing by strings. This would be a clear trade off between time and space. If the lists are short, it might even be faster than the dictionary (a linear search of ~10 keys is often going to be faster than a hashtable search). Even if at least most of the lists are short, it could still be a large improvement (e.g. if 95% of the lists have 10 or fewer items, and the other 5% have a max of maybe 100 items).
You could even use Collection<KeyValuePair<string, TreeNode>>
, which uses even less memory than List<T>
.
Upvotes: 4