Reputation: 32926
I'm building a million row spreadsheet and anything done in that process, times a million, can add up to a big hit. One of the problems I'm having is when I process a formula in a cell, I have to parse the formula, adjust the references, then build the formula back up. In the course of this I create 5 - 12 strings (depends on how many objects when tokenizing) that I use and then am done with.
I'm finding that the garbage collector is taking 70% of the time during this processing and the main objects created and then going out of scope to be collected are these strings.
Are there any approaches to reduce the GC hit? (If this was C++ I would just create a pool of strings to reuse.)
The details:
The is for a reporting program. We read the template, merge in the data to generate the final report, perform processing on that final report, and then write it out to disk. The report is held as a document object which in this case is 99% a single table, with 1 million rows (when all the data is merged), each row having 6 cells, each cell having optionally: a formula, a value, and/or a body of formatted text.
In the course of processing this there are a ton of strings created for short uses. The case where it's killing me is where the cell formula is adjusted. The template has a formula in a couple of the cells, something like "=A5+A6" which is then adjusted for the location of each row. I parse out the objects {"A5", "+", "A6"}, adjust each for the row they are now on, then in a StringBuilder put all those back together in a StringBuilder and the toString() that to assign back to the formula String object in the cell.
The difficulty in writing most of the document object to disk is that the document object is not read from, operated on, and a new one written out. To reduce the memory hit and to handle cases where we need to walk across columns as opposed to rows, we work on the object as is adjusting it in place.
The issue is when we get low on memory - the whole thing runs super fast until we get to that point. I'm using YourKit to profile and what's killing it is collecting String objects. Passing StringBuilder objects can help a bit but not a lot as I'm then going to be collecting a lot of those (fewer, but still a lot).
Upvotes: 3
Views: 1265
Reputation: 68715
Too many objects in memory, means too much of cleanup work for Garbage collector. I believe you can reduce the number of objects created if you use StringBuilder
/StringBuffer
instead of String
. Anytime you manipulate a String
object, a new object is created because of immutable nature of String.
But if you use StringBuilder
/StringBuffer
to manipulate strings then no new object is created. StringBuilder
/StringBuffer
are dynamically re-sized but you can also limit it if you chose the initial size of your StringBuilder
/StringBuffer
appropriately.
In short, lesser objects, lesser work required by Garbage Collector.
Upvotes: 2
Reputation: 46422
This hit has IMHO nothing to do with processing millions of strings. I've just measured that I can create 6 million strings per second sustained rate and the GC is pretty idle.
The problem seems to be that you're running out of memory. This makes the GC work more frequently and harder to keep the program running.
So do not waste time trying to reduce the allocation rate.
Get more memory or reduce the consumption. Getting more memory is usually the cheapest way. For reducing memory consumption consider:
char
takes 2 bytes, which means wasting half the memory (assuming you use mostly ASCII).Without your program, it's hard to say more.
Use -XX:+PrintGCDetails
and -XX:+PrintGCTimeStamps
. This is what I get - nearly no GC overhead:
10.075: [GC [PSYoungGen: 442272K->896K(425472K)] 442852K->1476K(769024K), 0.0016600 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
10.323: [GC [PSYoungGen: 425344K->928K(409600K)] 425924K->1508K(753152K), 0.0017150 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
10.558: [GC [PSYoungGen: 409504K->928K(394240K)] 410084K->1508K(737792K), 0.0014760 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
10.791: [GC [PSYoungGen: 394144K->928K(379904K)] 394724K->1508K(723456K), 0.0017070 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
Upvotes: 3