Reputation: 36745
I am looking for a Java data structure for storing a large text (about a million words), such that I can get a word by index (for example, get the 531467 word).
The problem with String[] or ArrayList is that they take too much memory - about 40 bytes per word on my environment.
I thought of using a String[] where each element is a chunk of 10 words, joined by a space. This is much more memory-efficient - about 20 bytes per word; but the access is much slower.
Is there a more efficient way to solve this problem?
Upvotes: 3
Views: 1225
Reputation: 36745
OK, I have experimented with several of your suggestions, and here are my results (I checked (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory()) before filling the array, and checked again after filling the array and gc()):
The average word length is about 3 chars, and most chars are non-ASCII so it's probably about 6 bytes. So, it seems that intern is close to the optimum. It makes sense, since it's an array of words, and many of the words appear much more than once.
Upvotes: 1
Reputation: 7100
-XX:+UseCompressedStrings
Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
Seems like a interesting article: http://www.javamex.com/tutorials/memory/string_saving_memory.shtml
I hear ropes are quite good in terms of speed in storing large strings, though not sure memory wise. But you might want to check it out. http://ahmadsoft.org/ropes/ http://en.wikipedia.org/wiki/Rope_%28computer_science%29
Upvotes: 1
Reputation: 533530
If you have a mobile device you can use TIntArrayList which would use 4 bytes per int value. If you use one index per word it will one need a couple of MB. You can also use int[]
If you have a PC or server, this is trivial amount of memory. Memory cost about £6 per GB or 1 cent per MB.
Upvotes: 0
Reputation: 28865
Store all the words in a single string:
class WordList {
private final String content;
private final int[] indices;
public WordList(Collection<String> words) {
StringBuilder buf = new StringBuilder();
indices = new int[words.size()];
int currentWordIndex = 0;
int previousPosition = 0;
for (String word : words) {
buf.append(word);
indices[currentWordIndex++] = previousPosition;
previousPosition += word.length();
}
content = buf.toString();
}
public String wordAt(int index) {
if (index == indices.length - 1) return content.substring(indices[index]);
return content.substring(indices[index], indices[index + 1]);
}
public static void main(String... args) {
WordList list = new WordList(Arrays.asList(args));
for (int i = 0; i < args.length; ++i) {
System.out.printf("Word %d: %s%n", i, list.wordAt(i));
}
}
}
Apart from the characters they contain, each word has an overhead of four bytes using this solution (the entry in indices
). Retrieving a word with wordAt
will always allocate a new string; you could avoid this by saving the toString()
of the StringBuilder
rather than the builder itself, although it uses more memory on construction.
Depending on the kind of text, language, and more, you might want a solution that deals with recurring words better (like the one previously proposed).
Upvotes: 2
Reputation: 70324
You could create a datastructure like this:
List<string> wordlist
Dictionary<string, int> tsildrow // for reverse lookup while building the structure
List<int> wordindex
wordlist
will contain a list of all (unique) words,
tsildrow
will give the index of a word in wordlist
and wordindex
will tell you the index in wordlist
of a specific index in your text.
You would operate it in the following fashion:
for word in text:
if not word in tsildrow:
wordlist.append(word)
tsildrow.add(word, wordlist.last_index)
wordindex.append(tsildrow[word])
this fills up your datastructure. Now, to find the word at index 531467:
print wordlist[wordindex[531467]]
you can reproduce the entire text like this:
for index in wordindex:
print wordlist[index] + ' '
except, that you will still have a problem of punctuation etc...
if you won't be adding any more words (i.e. your text is stable), you can delete tsildrow
to free up some memory if this is a concern of yours.
Upvotes: 1
Reputation: 20969
As Jon Skeet already mentioned, 40mb isn't too large.
But you stated that you are storing a text, so there may be many same Strings. For example stop words like "and" and "or".
You can use String.intern()[1]. This will pool your String and returns a reference to an already existing String.
intern() quite slow, so you can replace this by a HashMap that will do the same trick for you.
[1] http://download.oracle.com/javase/6/docs/api/java/lang/String.html#intern%28%29
Upvotes: 3
Reputation: 39406
I would probably consider using a file, with either fixed sized words or some sort of index. FileInputStream with skip can be pretty efficient
Upvotes: 0
Reputation: 604
You could look at using memory mapping the data structure, but performance might be completely horrible.
Upvotes: 2
Reputation: 1500835
One option would be to store byte arrays instead with the text encoded in UTF-8:
byte[][] words = ...;
Then:
public String getWord(int index)
{
return new String(words[index], "UTF-8");
}
This will be smaller in two ways:
I wouldn't really recommend this approach though... again it will be slower on access, as it needs to create a new String
each time. Fundamentally, if you need a million string objects (so you don't want to pay the recreation penalty each time) then you're going to have to use the memory for a million string objects...
Upvotes: 1