Reputation: 13471
I have a big xml document 250mb, which one of the tags contains another xml that I need to process.
But the problem is, this xml is wrapped by CDATA
and if I try to do a replace/replaceAll
String xml= fileContent.replace("<![CDATA[", " ");
String replace = xml.replace("]]>", " ");
I'm gettig
java.lang.OutOfMemoryError: Java heap space
A simple example of the structure.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<a>
<b>
<c>
<![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="yes"?><bigXML>]]>
</c>
</b>
</a>
Even using XML parser like VDT
or SAX
it does not help because I still need to remove the <![CDATA[
and what we have inside there is the biggest portion of the file.
Allocate more memory heap is not an option since is running in a machine where I dont have any JVM control.
Anny idea how to extract the xml from c
tag and also extract from <![CDATA[
UPDATE
I tried make the modification using Streams as we discuss bellow but still I'm having outOfMemories
.
Any idea how to improve the code to avoid the error?
private void readUpdateAndWrite(
Reader reader,
String absolutePath
) {
// Write the content in file
try (BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(absolutePath))) {
// Read the content from file
try (BufferedReader bufferedReader = new BufferedReader(reader)) {
String line = bufferedReader.readLine();
while (line != null) {
String replace = line
.replace("<![CDATA[", " ")
.replace("]]>", " ");
bufferedWriter.write(replace);
line = bufferedReader.readLine();
}
} catch (IOException e) {
logger.error("Error writing in file. Caused by {}", getStackTrace(e));
}
} catch (IOException e) {
logger.error("Error reading in file. Caused by {}", getStackTrace(e));
}
}
I found my problem. The content of <![CDATA[
is one String line of 256mb so I cannot make any replace in that line, or I get the outOfMemory
.
How can I break a String of 256mb into new lines. I tried to create another InputStream
through the massive String, but is not working.
I guess is because is an embedded XML and we cannot have multiline.
Upvotes: 2
Views: 928
Reputation: 15186
The issue you have is that you don't have enough memory to allocate copies of such a large String. The calls to String.replace
will be making a new String with a copy of the replaced section. If most text is inside those tags and fileContent
is 250MB then your double replace
will allocate 2 x 250MB strings in short succession.
Allocating more memory will fix this issue easily, but if you say you cannot do this, try a different way to load the string and scanning for the content. One way would be to scan for the file marker positions and save the matched section to another file. For example
String cdata = "<![CDATA[";
int start = fileContent.indexOf(cdata);
int end = fileContent.lastIndexOf("]]>");
Write out the stripped section to another file. This will not instantiate a second copy of 250MB string in memory and should leave you with file containing the section inside the <c>
tag for ongoing processing.
try(var os = Files.newBufferedWriter(bigxml)) {
os.write(fileContent, start+cdata.length(), end-start-cdata.length());
}
It's not ideal and might fail if there are multiple start/end markers in fileContent
.
Upvotes: 1
Reputation: 11
Out of memory is coming if the whole file is read as a string in memory. What if file is read chunk by chunk and do your operations and then write that chunk with modified data to another file, Hence saving the out of memory error.
You can try using buffered reader to read chunk by chunk :
BufferedReader buffer = new BufferedReader(file, int size);
Upvotes: 1