user2875404
user2875404

Reputation: 3218

JAVA Files.readAllBytes() without changing charset

I have a file that contains some plain text that I want to change. However, most of the file is unreadable for humans.

I at first used UTF-8 as charset, it found the text I wanted to replace, replaced/changed it correctly and wrote all this to a new file. But I encountered 2 problems: The size turned out to be almost twice as big as the original file, also it then became unreadable for other applications. I then tried the same with ISO-8859-1 which resulted in a file size much closer to the original file than with UTF-8 - but opening and comparing the files with a plan text editor showed me that ISO-8859-1 also "misinterpreted" and therefore added some bytes to the file. The file was also unreadable for the applications that were able to open the original file (MP4)

What I did is the following:

       String content;
        try {
            content = new String(Files.readAllBytes(path), ("ISO-8859-1"));
        } catch (IOException e1) {
            // TODO Auto-generated catch block
            e1.printStackTrace();
        }
        content = content.replaceAll("\"enabled\": false", "\"enabled\": true");
        try {
            Files.write(pathDestination, content.getBytes("ISO-8859-1"));
        } catch (IOException e1) {
            // TODO Auto-generated catch block
            e1.printStackTrace();
        }

I am pretty sure I could "keep" the exact same file if I just made my application read it "byte by byte" without any charsets, but then I also would have to convert all the bytes into blocks of UTF-8 text in order to find and replace the plain text in that file, just before turning it back into byte-wise data again in order to parse all this into the new file. There must be a better solution for this!

Just one example:

!7S€ÇŸ becomes

!/S”Ç— (including the hyphen) and just in case it's being displayed the same I uploaded a screenshot

enter image description here

Upvotes: 0

Views: 7621

Answers (1)

Louis Wasserman
Louis Wasserman

Reputation: 198341

If the file contains only some plain text and most of it is not intended to be read as characters, then you should only be converting the part of the file with plain text to a String. Converting arbitrary non-text bytes to String is really, deeply not a good idea.

I am pretty sure I could "keep" the exact same file if I just made my application read it "byte by byte" without any charsets, but then I also would have to convert all the bytes into blocks of UTF-8 text in order to find and replace the plain text in that file, just before turning it back into byte-wise data again in order to parse all this into the new file. There must be a better solution for this!

You should be paying attention to the actual format of the file, then. It is entirely possible that some random chunk of bytes -- video or audio, if the file is MP4 as you said -- just randomly happens to match the text you're looking for. That doesn't mean you should change those bytes.

If you're comfortable accepting that risk, then perhaps you should be converting the search text to bytes and searching for those bytes, rather than converting the bytes you're searching to text. That means you can't use replaceAll, though; you'll have to implement your own replacement implementation for bytes. That's still likely to be more correct, however.

Upvotes: 2

Related Questions