ElyMan
ElyMan

Reputation: 45

Modify content of large file

I have extract my tables from my database in json file, now I want to read this files and remove all double quotes on them, seems easy and tried hundred of solutions, and some take me to the out of memory problems. I'm dealing with files that they have more than 1Gb size.The code that you will find below have a strange behaviour, and I don't understand why it return empty files

  public void replaceDoubleQuotes(String fileName){
    log.debug(" start formatting " + fileName + " ...");
    File firstFile = new File ("C:/sqlite/db/tables/" + fileName);
    String oldContent = "";
    String newContent = "";
    BufferedReader reader = null;
    BufferedWriter writer = null;
    FileWriter writerFile = null;
    String stringQuotes = "\\\\\\\\\"";
    try {
        reader = new BufferedReader(new FileReader(firstFile));
        writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
        writer = new BufferedWriter(writerFile);
        
    while   (( oldContent = reader.readLine()) != null ){
        newContent = oldContent.replaceAll(stringQuotes, "");
        writer.write(newContent);
        }
    
    writer.flush();
    writer.close();
    } catch (Exception e) {
        log.error(e);
    }
}

and when I try to use FileWriter(path,true) to write at the end of the file the program don't stop increasing the file memory till the hard disk will be full, thanks for help

ps : I also tried to use subString and append the new content and after the while I write the subString but also doesn't work

Upvotes: 1

Views: 946

Answers (2)

GPI
GPI

Reputation: 9328

TL; DR;

Do not read and write the same file concurrently.

The issue

Your code starts reading, and then immediately truncates the file it is reading.

 reader = new BufferedReader(new FileReader(firstFile));
 writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
 writer = new BufferedWriter(writerFile);
    

The first line opens a read handle to the file. The second line opens a write handle to the same file. It is not very clear if you look at the documentation of FileWriter constructor, but when you do not use a constructor that allows you to specify the append parameter, then the value is false by default, meaning, you immediately truncate the file if it already exists.

At this point (line 2) you have just erased the file you were about to read. So you end up with an empty file.

What about using append=true

Well, then the file is not erased when it is created, which is "good". So you program starts reading the first line, and outputs (to the same file) the filtered version.

So each time a line is read, another is appended.

No wonder your program will never reach the end of the file : each time it advances a line, it creates another line to process. Generally speaking, you'll never reach end of file (well of course if the file is a single line to begin with, you might but that's a corner case).

The solution

Write to a temporary file, and IF (and only IF) you succed, then swap the files if you really need too.

An advantage of this solution : if for whatever reason your processe crahses, you'll have the original file untouched and you could retry later, which is usually a good thing. Your process is "repeatable".

A disadvantage : you'll need twice the space at some point. (Although you could compress the temp file and reduce this factor but still).

About out of memory issues

When working with arbitrarily large files, the path you chose (using buffered readers and writers) is the right one, because you only use one line-worth of memory at a time.

Therefore it generally avoids memory usage issues (unless of course, you have a file without line breaks, in which case it makes no difference at all).

Other solutions, involving reading the whole file at once, then performing the search/replace in memory, then writing the contents back do not scale that well, so it's good you avoided this kind of computation.

Not related but important

Check out the try with resources syntax to properly close your resources (reader / writer). Here you forgot to close the reader, and you are not closing the writer appropriately anyway (that is : in a finally clause).

Another thing : I'm pretty sure no java program written by a mere mortal will beat tools like sed or awk that are available on most unix platforms (and some more). Maybe you'd want to check if rolling your own in java is worth what is a shell one-liner.

Upvotes: 3

Jason
Jason

Reputation: 5246

@GPI already provided a great answer on why reading and writing concurrently is causing the issue you're experiencing. It is also worth noting that reading 1gb of data into heap at once can definitely cause a OutOfMemoryError if enough heap isn't allocated which is likely. To solve this problem you could use an InputStream and read chunks of the file at a time, then write to another file until the process is completed, and ultimately replace the existing file with the modified one and delete. With this approach you could even use a ForkJoinTask to help with this since it's such a large job.

Side note; There may be a better solution than create new file, write to new file, replace existing, delete new file.

Upvotes: 0

Related Questions