Reputation: 126
I have two large CSV files which contain data that is required for users of a web application to validate some info. I defined an ArrayList< String[] > and intended to keep the contents of both files in memory so I wouldn't have to read them each time a user logged in and used the application.
I'm getting a java.lang.OutOfMemoryError: Java heap space, though, when initializing the application and trying to read the second file. (It finishes reading the first file just fine but hangs when reading the second file and after a while I get that exception)
The code for reading the files is pretty straight forward:
ArrayList<String[]> tokenizedLines = new ArrayList<String[]>();
public void parseTokensFile() throws Exception {
BufferedReader bRead = null;
FileReader fRead = null;
try {
fRead = new FileReader(this.tokensFile);
bRead = new BufferedReader(fRead);
String line;
while ((line = bRead.readLine()) != null) {
tokenizedLines.add(StringUtils.split(line, fieldSeparator));
}
} catch (Exception e) {
throw new Exception("Error parsing file.");
} finally {
bRead.close();
fRead.close();
}
}
I read Java's split function could use up a lot of memory when reading large amounts of data since the substring function makes a reference to the original string, so a substring of some String will use up the same amount of memory as the original, even though we only want a few chars, so I made a simple split function to try avoiding this:
public String[] split(String inputString, String separator) {
ArrayList<String> storage = new ArrayList<String>();
String remainder = new String(inputString);
int separatorLength = separator.length();
while (remainder.length() > 0) {
int nextOccurance = remainder.indexOf(separator);
if (nextOccurance != -1) {
storage.add(new String(remainder.substring(0, nextOccurance)));
remainder = new String(remainder.substring(nextOccurance + separatorLength));
} else {
break;
}
}
storage.add(remainder);
String[] tokenizedFields = storage.toArray(new String[storage.size()]);
storage = null;
return tokenizedFields;
}
This gives me the same error though, so I'm wondering if it's not a memory leak but simply that I can't have structures with so many objects in memory. One file is about 600'000 lines long, with 5 fields per line, and the other is around 900'000 lines long with about the same amount of fields per line.
The full stacktrace is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at xxx.xxx.xxx.StringUtils.split(StringUtils.java:16)
at xxx.xxx.xxx.GFTokensFile.parseTokensFile(GFTokensFile.java:36)
So, after the long post (sorry :P), is this a restriction of the amount of memory assigned to my JVM or am I missing something obvious and wasting resources somewhere?
Upvotes: 0
Views: 466
Reputation: 20065
Be sure that the total length of both files is lower than your heap size. You can set the max heap size using the JVM option -Xmx
.
Then if you have so much content maybe you shouldn't load it entirely in memory. One time I had a similar problem and I fixed it using an index file that store index of informations in the large file. then I just had to read one line at the good offset.
Also in your split method there is some strange things.
String remainder = new String(inputString);
You don't have to take care of preserve inputString
using a copy, String are immutable so changes only apply to the scope of the split method.
Upvotes: 0
Reputation: 4992
While I wouldn't recommend actual string interning for what you are doing, how about using the idea behind that technique? You could use a HashSet or HashMap to make sure you only use a single String instance whenever your data contains the same sequence of characters. I mean, there must be some kind of overlap in the data, right?
On the other hand, what you might be seeing here could be a bad case of heap fragmentation. I'm not sure how the JVM handles these cases, but in the Microsoft CLR larger objects (especially arrays) will be allocated on a separate heap. Growth strategies, such as those of the ArrayList will create a larger array, then copy over the content of the previous array before releasing the reference to it. The Large Object Heap (LOH) isn't compacted in the CLR, so this growth strategy will leave huge areas of free memory that the ArrayList can no longer use.
I don't know how much of that applies to the Lava VM, but you could try building the list using LinkedList first, then dump the list content into an ArrayList or directly into an array. That way the large array of lines would be created only once, without causing any fragmentation.
Upvotes: 0
Reputation: 106
Try improving your code or leave data processing to a database.
The memory usage is larger as your file sizes, since the code makes redundant copies of the processed data. There is a to be processed one processed and some partial data. String is immutable, see here, no need to use new String(...) to store the result, split does that copy already.
If you can, delegate the whole data storage and searching to a database. CSV files are easily imported/exported to databases and they do all the hard work.
Upvotes: 1
Reputation: 308998
Your JVM won't get more than 2GB on a 32-bit operating system with 4GB of RAM. That's one upper limit.
The second is the max heap size you specify when you start the JVM. Look at that -Xmx parameter.
The third is the fact of life that you cannot fit X units of anything into a Y sized container where X > Y. You know the size of your files. Try parsing each one individually and seeing what kind of heap they're consuming.
I'd recommend that you download Visual VM, install all the available plugins, and have it monitor your application while it's running. You'll be able to see the entire heap, perm gen space, GC collection, what objects are taking up the most memory, etc.
Getting data is invaluable for all problems, but especially ones like this. Without it, you're just guessing.
Upvotes: 4
Reputation: 719446
I cannot see a storage leak in the original version of the program.
The scenarios where split
and similar methods can leak significant storage are rather limitted:
You have to NOT be retaining a reference to the original string that you split.
You need to be retaining references to a subset of the strings produced by the string splitting.
What happens when String.substring()
is called is that it creates a new String object that shares the original String's backing array. If the original String reference is then garbage collected, then the substring String is now holding onto an array of characters that includes characters that are not "in" the substring. This can be a storage leak, depending on how long the substring is kept.
In your example, you are keeping strings that contain all characters apart for the field separator character. There is a good chance that this is actually saving space ... compared to the space used if each substring was an independent String. Certainly, it is no surprise that your version of split
doesn't solve the problem.
I think you need to either increase the heap size, or change your application so that it doesn't need to keep all of the data in memory at the same time.
Upvotes: 2