Reputation: 121
I have a Java program that is supposed to read independent serialized objects from a file (no interdependencies between the objects), process them, then write them as independent serialized objects to another file. It looks something like this (forgive any typos, this is hand written, as the program does other stuff, like figure out how it should be processed):
try {
ObjectOutputStream fileOut = new ObjectOutputStream(new FileOutputStream("outputFile"));
ObjectInputStream fileIn = new ObjectInputStream(new FileInputStream("inputFile"));
for(int i = 0; i < numThingsInFile; i++){
MyObject thingToProcess = (MyObject) fileIn.readObject();
thingToProcess.process();
fileOut.writeObject(thingToProcess);
fileOut.flush();
}
fileIn.close();
fileOut.close();
} catch (IOException | ClassNotFoundException e1) {
e1.printStackTrace();
}
The code does the process correctly. And, from my end, I should be discarding "thingsToProcess" on every iteration of the loop, which should get garbage collected at the computer's leisure.
However, the memory used by the program keeps increasing as it reads more things until it slows to a crawl. I used a heap dump and an analyzer to look at it and it says the ObjectInputStream fileIn is taking up an absurd amount of memory. Specifically, it says the "entries" array is huge. It is significantly larger than the file it originated from. The file is 400 kB, but this entries array is over 600 MB just from reading that file. I also have other threads reading other files in the same way, so I am running out of memory. I know I could give Java more memory, but that is a band-aid solution that doesn't fix the underlying problem, as I want this process to work with larger files with more objects.
I would prefer not to break up the files more than they already are.
Is there a way to have the ObjectInputStream not store previous entries or clear the previous entries?
I've tried adding a BufferedInputStream and using mark/reset (before I realized the issue was the entries array within ObjectInputStream):
ObjectInputStream fileIn = new ObjectInputStream(new FileInputStream("inputFile"));
I've tried using readUnshared():
MyObject thingToProcess = (MyObject) fileIn.readUnshared();
This improved things and let me run my program, but it still had hundreds of thousands of objects in its Entries array that was expanding as time went on, which would cause problems with more objects.
I've tried calling fileOut.reset()
, but this did not resolve the issue. On the idea that the file may have been formatted strangely, I also added resets to the 'ObjectOutputStream' that wrote the file inputFile
.
Upvotes: 1
Views: 254
Reputation: 298499
You can not clear back-references on an ObjectInputStream
, as the ObjectInputStream
must be prepared to handle the back-references of the incoming data, as produced by the writing side. That’s why the writing side is responsible for calling reset()
, to enforce that no back-references may occur after this point.
Note that this data sharing even applies to the class descriptors of the stored instances, so a hypothetical way to reset the input stream not matching the output stream would break as soon as you try to read the next instance of MyObject
, as it has a back-reference to the previously written MyObject.class
.
This also implies that calling reset()
can produce significantly bigger files, as even if the MyObject
instances weren’t shared anyway, there might be more shared data than you were aware of, which will become duplicated after reset()
.
When you call readUnshared()
, you will enforce that the stream does not store a back-reference, but when it is not paired with a writeUnshared()
on the producing side, there is the risk that the writing side did write the reference again, which will produce an exception on the reading side.
The following program demonstrates that using either, reset()
or readUnshared()
, will have the intended effect of not maintaining references in the ObjectInputStream
:
public class Main {
static class MyObject implements Serializable {}
public static void main(String[] args) throws Exception {
for(int run = 0; run < 3; run++) {
boolean withReset = run == 1, readUnshared = run == 2;
System.out.println(withReset? " ** with reset":
readUnshared? " ** with readUnshared": " ** without reset");
ByteArrayOutputStream o = new ByteArrayOutputStream();
try(ObjectOutputStream os = new ObjectOutputStream(o)) {
for(int i = 0; i < 10; i++) {
os.writeObject(new MyObject());
if(withReset) os.reset();
}
}
System.out.println(o.size() + " bytes");
ReferenceQueue<MyObject> q = new ReferenceQueue<>();
Set<WeakReference<MyObject>> refs = new HashSet<>();
try(ObjectInputStream is = new ObjectInputStream(
new ByteArrayInputStream(o.toByteArray()))) {
for(int i = 0; i < 10; i++) {
MyObject t = (MyObject)
(readUnshared? is.readUnshared(): is.readObject());
System.out.println("read " + t);
refs.add(new WeakReference<MyObject>(t, q));
t = null;
System.gc();
for(;;) {
Reference<?> r = q.remove(100);
if(r == null) break;
System.out.println("One MyObject garbage collected");
refs.remove(r);
}
}
}
System.out.println("ObjectInputStream close");
System.gc();
for(;;) {
Reference<?> r = q.remove(100);
if(r == null) break;
System.out.println("One MyObject garbage collected");
refs.remove(r);
}
if(!refs.isEmpty()) {
System.out.println(refs.size() + " MyObject(s) not collected");
}
System.out.println();
}
}
}
which will print
** without reset
88 bytes
read Main$MyObject@5010be6
read Main$MyObject@7daf6ecc
read Main$MyObject@238e0d81
read Main$MyObject@377dca04
read Main$MyObject@21b8d17c
read Main$MyObject@5910e440
read Main$MyObject@533ddba
read Main$MyObject@7a07c5b4
read Main$MyObject@3d646c37
read Main$MyObject@5a10411
ObjectInputStream close
One MyObject garbage collected
One MyObject garbage collected
One MyObject garbage collected
One MyObject garbage collected
One MyObject garbage collected
One MyObject garbage collected
One MyObject garbage collected
One MyObject garbage collected
One MyObject garbage collected
One MyObject garbage collected
** with reset
314 bytes
read Main$MyObject@3eb07fd3
read Main$MyObject@69d0a921
One MyObject garbage collected
read Main$MyObject@799f7e29
One MyObject garbage collected
read Main$MyObject@277050dc
One MyObject garbage collected
read Main$MyObject@7aec35a
One MyObject garbage collected
read Main$MyObject@42110406
One MyObject garbage collected
read Main$MyObject@22d8cfe0
One MyObject garbage collected
read Main$MyObject@1de0aca6
One MyObject garbage collected
read Main$MyObject@41906a77
One MyObject garbage collected
read Main$MyObject@5387f9e0
One MyObject garbage collected
ObjectInputStream close
One MyObject garbage collected
** with readUnshared
88 bytes
read Main$MyObject@5b37e0d2
One MyObject garbage collected
read Main$MyObject@5a2e4553
One MyObject garbage collected
read Main$MyObject@6659c656
One MyObject garbage collected
read Main$MyObject@45ff54e6
One MyObject garbage collected
read Main$MyObject@bebdb06
One MyObject garbage collected
read Main$MyObject@45283ce2
One MyObject garbage collected
read Main$MyObject@7591083d
One MyObject garbage collected
read Main$MyObject@736e9adb
One MyObject garbage collected
read Main$MyObject@108c4c35
One MyObject garbage collected
read Main$MyObject@4bf558aa
One MyObject garbage collected
ObjectInputStream close
One interesting point is that readUnshared()
will not maintain a reference in the first place, whereas the reset will be performed on the reading side when encountering the reset marker on the next read operation, so the garbage collection is one object behind, compared to the readUnshared()
approach. Further, as predicted, the serialized data is much bigger when using reset()
.
So, the best option for your scenario, is to use writeUnshared
on the producing side, paired with readUnshared
on the reading side.
Upvotes: 1