Reputation: 55
I have to read a text file of 226mb made like this:
0 25
1 1382
2 99
3 3456
4 921
5 1528
6 578
7 122
8 528
9 81
the first number is a index, the second a value. I want to load a vector of short reading this file (8349328 positions), so I wrote this code:
Short[] docsofword = new Short[8349328];
br2 = new BufferedReader(new FileReader("TermOccurrenceinCollection.txt"));
ss = br2.readLine();
while(ss!=null)
{
docsofword[Integer.valueOf(ss.split("\\s+")[0])] = Short.valueOf(ss.split("\\s+")[1]); //[indexTerm] - numOccInCollection
ss = br2.readLine();
}
br2.close();
It turns out that the entire load takes an incredible amount of memory of 4.2GB. Really i don't understand why, i expected a 15MB vector. Thanks for any answer.
Upvotes: 1
Views: 96
Reputation: 1213
If file is generated by you, use objectOutputStream, It very easy way to read the file.
As @Durandal, change the code accordingly. I am giving sample code below.
short[] docsofword = new short[8349328];
br2 = new BufferedReader(new FileReader("TermOccurrenceinCollection.txt"));
ss = br2.readLine();
int strIndex, index;
while(ss!=null)
{
strIndex = ss.indexOf( ' ' );
index = Integer.parseInt(ss.subStr(0, strIndex));
docsofword[index] = Short.parseShort(ss.subStr(strIndex+1));
ss = br2.readLine();
}
br2.close();
Even you can optimise further. Instead of indexOf() we can write our own method, when char is matching to space, parse string as integer. After that we will get indexOf Space and index for get remain string.
Upvotes: 0
Reputation: 20059
There are multiple effects at work here.
First, you declared your array as type Short[] insted of short[]. The former is a reference type, meaning each value is wrapped into an instance of Short, consuming the overhead of a full blown object (most likely 16 bytes instead of two). This also inflates each array slot from two bytes to the reference size (generally 4 or 8 bytes, depending on heap size and 32/64 bit VM). The minimum size you can expect for the fully populated array is thus approximately: 8349328 x 20 = 160MB.
Your reading code is happily producing tons of garbage objects - you are using again a wrapper type (Integer) to address the array where a simple int would do. Thats at least 16 bytes of garbage where it would be zero with int. String.split is another culprit, you force the compilation of two regular expressions per line, plus create two strings. Thats numerous short lived objects that become garbage for each line. All of that could be avoided with a few more lines of code.
So you have a relatively memory hungry array, and lots of garbage. The garbage memory can be cleaned up, but the JVM decides when. The decision is based on available maximum heap memory and garbage collector parameters. If you supplied no arguments for either, the JVM will happily fill your machines memory before it attempts to reclaim garbage.
TLDR: Inefficient reading code paired with no JVM parameters.
Upvotes: 3