John Lexus
John Lexus

Reputation: 3656

Can UTF8 validation be done on a char[], or must it be done at the original byte[]?

I am attempting to validate that files I am ingesting are all strictly UTF8 compliant, and through my several readings, I have come to the conclusion that if the validation is to be done correctly, the original, untampered bytes of the data must be analyzed. If one attempts to look at the string itself after the fact, they are unlikely to find if any characters are non-UTF8 compliant, as Java will attempt to convert them.

I am reading the files normally: I receive an InputStream from the file, and then feed to it an InputStreamReader, then feed that to BufferedReader. It would look something like:

InputStream is = new FileInputStream(fileLocation);
InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8)));
BufferedReader br = new BufferedReader(isr);

I can override the BufferedReader class to add some validation for each character it stumbles across.

The issue is that BufferedReader has a char[], not a byte[], for the buffer. That means the bytes get auto-converted to chars.

So, my question is: can this validation be done at the char[] level located in BufferedReader? Although I am somewhat "enforcing" UTF8 here:

InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8)));

I am seeing characters get transformed from non utf-8 (like, say, utf-16) to utf-8, and breaking some systems. I don't know that the char[] is basically "too late" for this validation. Is it truly?

Upvotes: 1

Views: 400

Answers (1)

rzwitserloot
rzwitserloot

Reputation: 103813

Define UTF-8 compliant. There are 2 events that you can reasonably call 'invalid'. UTF-8 as a format converts 32-bit numbers into byte sequences, and can't convert just any number, only limited sets (but all numbers that could possibly come up in unicode can be converted).

  • A valid conversion for a non-existing glyph.

Not every single one of the 32-bit numbers that UTF-8 can store actually are a valid unicode codepoint. However, unicode expands all the time. What isn't valid today might be valid tomorrow. There is no real way to know this stuff unless you have the entire unicode table loaded.

  • An invalid sequence

Usually when converting bytes to text (char, String, Reader, Writer, StringBuilder - anything that is character oriented), and you attempt to convert an invalid byte sequence, you either get an exception or if the process is in lenient mode, the failure is converted to a character that means 'this was not valid'.

If the exception occured, then you couldn't possibly have a char array (the exception occurred instead of returning a char array). If it didn't, you have that glyph in your characters, so just search for that.

Upvotes: 4

Related Questions