Reputation: 659
I am trying to solve this interview question.
After given clearly definition of UTF-8 format. ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether the input is valid UTF-8. Input will be string/byte array, output should be yes/no.
I have two possible approaches.
First, if the input is a string, since UTF-8 is at most 4 byte, after we remove the first two characters "0b", we can use Integer.parseInt(s) to check if the rest of the string is at the range 0 to 10FFFF. Moreover, it is better to check if the length of the string is a multiple of 8 and if the input string contains all 0s and 1s first. So I will have to go through the string twice and the complexity will be O(n).
Second, if the input is a byte array (we can also use this method if the input is a string), we check if each 1-byte element is in the correct range. If the input is a string, first check the length of the string is a multiple of 8 then check each 8-character substring is in the range.
I know there are couple solutions on how to check a string using Java libraries, but my question is how I should implement the function based on the question.
Thanks a lot.
Upvotes: 16
Views: 28730
Reputation: 2959
A small solution for real world UTF-8 compatibility checking:
public static final boolean isUTF8(final byte[] inputBytes) {
final String converted = new String(inputBytes, StandardCharsets.UTF_8);
final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
return Arrays.equals(inputBytes, outputBytes);
}
You can check the tests results:
@Test
public void testEnconding() {
byte[] invalidUTF8Bytes1 = new byte[]{(byte)0b10001111, (byte)0b10111111 };
byte[] invalidUTF8Bytes2 = new byte[]{(byte)0b10101010, (byte)0b00111111 };
byte[] validUTF8Bytes1 = new byte[]{(byte)0b11001111, (byte)0b10111111 };
byte[] validUTF8Bytes2 = new byte[]{(byte)0b11101111, (byte)0b10101010, (byte)0b10111111 };
assertThat(isUTF8(invalidUTF8Bytes1)).isFalse();
assertThat(isUTF8(invalidUTF8Bytes2)).isFalse();
assertThat(isUTF8(validUTF8Bytes1)).isTrue();
assertThat(isUTF8(validUTF8Bytes2)).isTrue();
assertThat(isUTF8("\u24b6".getBytes(StandardCharsets.UTF_8))).isTrue();
}
Test cases copy from https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array
Upvotes: 4
Reputation: 23780
public static boolean validUTF8(byte[] input) {
int i = 0;
// Check for BOM
if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
&& (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
i = 3;
}
int end;
for (int j = input.length; i < j; ++i) {
int octet = input[i];
if ((octet & 0x80) == 0) {
continue; // ASCII
}
// Check for UTF-8 leading byte
if ((octet & 0xE0) == 0xC0) {
end = i + 1;
} else if ((octet & 0xF0) == 0xE0) {
end = i + 2;
} else if ((octet & 0xF8) == 0xF0) {
end = i + 3;
} else {
// Java only supports BMP so 3 is max
return false;
}
while (i < end) {
i++;
octet = input[i];
if ((octet & 0xC0) != 0x80) {
// Not a valid trailing byte
return false;
}
}
}
return true;
}
Upvotes: 2
Reputation: 2001
the CharsetDecoder
might be what you are looking for:
@Test
public void testUTF8() throws CharacterCodingException {
// the desired charset
final Charset UTF8 = Charset.forName("UTF-8");
// prepare decoder
final CharsetDecoder decoder = UTF8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
byte[] bytes = new byte[48];
new Random().nextBytes(bytes);
ByteBuffer buffer = ByteBuffer.wrap(bytes);
try {
decoder.decode(buffer);
fail("Should not be UTF-8");
} catch (final CharacterCodingException e) {
// noop, the test should fail here
}
final String string = "hallo welt!";
bytes = string.getBytes(UTF8);
buffer = ByteBuffer.wrap(bytes);
final String result = decoder.decode(buffer).toString();
assertEquals(string, result);
}
so your function might look like that:
public static boolean checkEncoding(final byte[] bytes, final String encoding) {
final CharsetDecoder decoder = Charset.forName(encoding).newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
final ByteBuffer buffer = ByteBuffer.wrap(bytes);
try {
decoder.decode(buffer);
return true;
} catch (final CharacterCodingException e) {
return false;
}
}
Upvotes: 1
Reputation: 21004
Let's first have a look at a visual representation of the UTF-8 design.
Now let's resume what we have to do.
x
characters represent the actual codepoint. We will use the binary AND operator (&
) which copy a bit to the result if it exists in both operands.0b1xxxxxxx
where 1 will appear "Bytes in sequence" time, and other bits will be 0.The method would look like this :
public static final boolean isUTF8(final byte[] pText) {
int expectedLength = 0;
for (int i = 0; i < pText.length; i++) {
if ((pText[i] & 0b10000000) == 0b00000000) {
expectedLength = 1;
} else if ((pText[i] & 0b11100000) == 0b11000000) {
expectedLength = 2;
} else if ((pText[i] & 0b11110000) == 0b11100000) {
expectedLength = 3;
} else if ((pText[i] & 0b11111000) == 0b11110000) {
expectedLength = 4;
} else if ((pText[i] & 0b11111100) == 0b11111000) {
expectedLength = 5;
} else if ((pText[i] & 0b11111110) == 0b11111100) {
expectedLength = 6;
} else {
return false;
}
while (--expectedLength > 0) {
if (++i >= pText.length) {
return false;
}
if ((pText[i] & 0b11000000) != 0b10000000) {
return false;
}
}
}
return true;
}
Edit : The actual method is not the original one (almost, but not) and is stolen from here. The original one was not properly working as per @EJP comment.
Upvotes: 16
Reputation: 659
Well, I am grateful for the comments and the answer. First of all, I have to agree that this is "another stupid interview question". It is true that in Java String is already encoded, so it will always be compatible with UTF-8. One way to check it is given a string:
public static boolean isUTF8(String s){
try{
byte[]bytes = s.getBytes("UTF-8");
}catch(UnsupportedEncodingException e){
e.printStackTrace();
System.exit(-1);
}
return true;
}
However, since all the printable strings are in the unicode form, so I haven't got a chance to get an error.
Second, if given a byte array, it will always be in the range -2^7(0b10000000) to 2^7(0b1111111), so it will always be in a valid UTF-8 range.
My initial understanding to the question was that given a string, say "0b11111111", check if it is a valid UTF-8, I guess I was wrong.
Moreover, Java does provide constructor to convert byte array to string, and if you are interested in the decode method, check here.
One more thing, the above answer would be correct given another language. The only improvement could be:
In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.
So 4 bytes would be enough.
I am definitely to this, so correct me if I am wrong. Thanks a lot.
Upvotes: -1