paul
paul

Reputation: 13471

replace with big files java heap space out of memory

I have a big xml document 250mb, which one of the tags contains another xml that I need to process.

But the problem is, this xml is wrapped by CDATA and if I try to do a replace/replaceAll

String xml= fileContent.replace("<![CDATA[", "  ");
String replace = xml.replace("]]>", " ");

I'm gettig

java.lang.OutOfMemoryError: Java heap space

A simple example of the structure.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<a>
    <b>
        <c>
            <![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="yes"?><bigXML>]]>
        </c>
    </b>
</a>

Even using XML parser like VDT or SAX it does not help because I still need to remove the <![CDATA[ and what we have inside there is the biggest portion of the file.

Allocate more memory heap is not an option since is running in a machine where I dont have any JVM control.

Anny idea how to extract the xml from c tag and also extract from <![CDATA[

UPDATE

I tried make the modification using Streams as we discuss bellow but still I'm having outOfMemories.

Any idea how to improve the code to avoid the error?

private void readUpdateAndWrite(
    Reader reader,
    String absolutePath
) {
    // Write the content in file
    try (BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(absolutePath))) {
        // Read the content from file
        try (BufferedReader bufferedReader = new BufferedReader(reader)) {
            String line = bufferedReader.readLine();
            while (line != null) {
                String replace = line
                    .replace("<![CDATA[", " ")
                    .replace("]]>", " ");
                bufferedWriter.write(replace);
                line = bufferedReader.readLine();
            }
        } catch (IOException e) {
            logger.error("Error writing in file. Caused by {}", getStackTrace(e));
        }
    } catch (IOException e) {
        logger.error("Error reading in file. Caused by {}", getStackTrace(e));
    }
}

I found my problem. The content of <![CDATA[ is one String line of 256mb so I cannot make any replace in that line, or I get the outOfMemory.

How can I break a String of 256mb into new lines. I tried to create another InputStream through the massive String, but is not working.

I guess is because is an embedded XML and we cannot have multiline.

Upvotes: 2

Views: 928

Answers (2)

DuncG
DuncG

Reputation: 15186

The issue you have is that you don't have enough memory to allocate copies of such a large String. The calls to String.replace will be making a new String with a copy of the replaced section. If most text is inside those tags and fileContent is 250MB then your double replace will allocate 2 x 250MB strings in short succession.

Allocating more memory will fix this issue easily, but if you say you cannot do this, try a different way to load the string and scanning for the content. One way would be to scan for the file marker positions and save the matched section to another file. For example

String cdata = "<![CDATA[";
int start = fileContent.indexOf(cdata);
int end   = fileContent.lastIndexOf("]]>");

Write out the stripped section to another file. This will not instantiate a second copy of 250MB string in memory and should leave you with file containing the section inside the <c> tag for ongoing processing.

try(var os = Files.newBufferedWriter(bigxml)) {
    os.write(fileContent, start+cdata.length(), end-start-cdata.length());
}

It's not ideal and might fail if there are multiple start/end markers in fileContent.

Upvotes: 1

Manas
Manas

Reputation: 11

Out of memory is coming if the whole file is read as a string in memory. What if file is read chunk by chunk and do your operations and then write that chunk with modified data to another file, Hence saving the out of memory error.

You can try using buffered reader to read chunk by chunk :

BufferedReader buffer = new BufferedReader(file, int size);

Upvotes: 1

Related Questions