Reputation: 659

Implement a function to check if a string/byte array follows utf-8 format

I am trying to solve this interview question.

After given clearly definition of UTF-8 format. ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether the input is valid UTF-8. Input will be string/byte array, output should be yes/no.

I have two possible approaches.

First, if the input is a string, since UTF-8 is at most 4 byte, after we remove the first two characters "0b", we can use Integer.parseInt(s) to check if the rest of the string is at the range 0 to 10FFFF. Moreover, it is better to check if the length of the string is a multiple of 8 and if the input string contains all 0s and 1s first. So I will have to go through the string twice and the complexity will be O(n).

Second, if the input is a byte array (we can also use this method if the input is a string), we check if each 1-byte element is in the correct range. If the input is a string, first check the length of the string is a multiple of 8 then check each 8-character substring is in the range.

I know there are couple solutions on how to check a string using Java libraries, but my question is how I should implement the function based on the question.

Thanks a lot.

Upvotes: 16

Answers (5)

Thiago Mata

Reputation: 2959

A small solution for real world UTF-8 compatibility checking:

public static final boolean isUTF8(final byte[] inputBytes) {
    final String converted = new String(inputBytes, StandardCharsets.UTF_8);
    final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
    return Arrays.equals(inputBytes, outputBytes);
}

You can check the tests results:

@Test
public void testEnconding() {

    byte[] invalidUTF8Bytes1 = new byte[]{(byte)0b10001111, (byte)0b10111111 };
    byte[] invalidUTF8Bytes2 = new byte[]{(byte)0b10101010, (byte)0b00111111 };
    byte[] validUTF8Bytes1 = new byte[]{(byte)0b11001111, (byte)0b10111111 };
    byte[] validUTF8Bytes2 = new byte[]{(byte)0b11101111, (byte)0b10101010, (byte)0b10111111 };

    assertThat(isUTF8(invalidUTF8Bytes1)).isFalse();
    assertThat(isUTF8(invalidUTF8Bytes2)).isFalse();
    assertThat(isUTF8(validUTF8Bytes1)).isTrue();
    assertThat(isUTF8(validUTF8Bytes2)).isTrue();
    assertThat(isUTF8("\u24b6".getBytes(StandardCharsets.UTF_8))).isTrue();
}

Test cases copy from https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array

Upvotes: 4

Koray Tugay

Reputation: 23844

public static boolean validUTF8(byte[] input) {
    int i = 0;
    // Check for BOM
    if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
            && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
        i = 3;
    }

    int end;
    for (int j = input.length; i < j; ++i) {
        int octet = input[i];
        if ((octet & 0x80) == 0) {
            continue; // ASCII
        }

        // Check for UTF-8 leading byte
        if ((octet & 0xE0) == 0xC0) {
            end = i + 1;
        } else if ((octet & 0xF0) == 0xE0) {
            end = i + 2;
        } else if ((octet & 0xF8) == 0xF0) {
            end = i + 3;
        } else {
            // Java only supports BMP so 3 is max
            return false;
        }

        while (i < end) {
            i++;
            octet = input[i];
            if ((octet & 0xC0) != 0x80) {
                // Not a valid trailing byte
                return false;
            }
        }
    }
    return true;
}

Upvotes: 2

benez

Reputation: 2011

the CharsetDecoder might be what you are looking for:

@Test
public void testUTF8() throws CharacterCodingException {
    // the desired charset
    final Charset UTF8 = Charset.forName("UTF-8");
    // prepare decoder
    final CharsetDecoder decoder = UTF8.newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    decoder.onUnmappableCharacter(CodingErrorAction.REPORT);

    byte[] bytes = new byte[48];
    new Random().nextBytes(bytes);
    ByteBuffer buffer = ByteBuffer.wrap(bytes);
    try {
        decoder.decode(buffer);
        fail("Should not be UTF-8");
    } catch (final CharacterCodingException e) {
        // noop, the test should fail here
    }

    final String string = "hallo welt!";
    bytes = string.getBytes(UTF8);
    buffer = ByteBuffer.wrap(bytes);
    final String result = decoder.decode(buffer).toString();
    assertEquals(string, result);
}

so your function might look like that:

public static boolean checkEncoding(final byte[] bytes, final String encoding) {
    final CharsetDecoder decoder = Charset.forName(encoding).newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    final ByteBuffer buffer = ByteBuffer.wrap(bytes);

    try {
        decoder.decode(buffer);
        return true;
    } catch (final CharacterCodingException e) {
        return false;
    }
}

Upvotes: 1

Jean-François Savard

Reputation: 21004

Let's first have a look at a visual representation of the UTF-8 design.

enter image description here

Now let's resume what we have to do.

Loop over all character of the string (each character being a byte).
We will need to apply a mask to each byte depending on the codepoint as the x characters represent the actual codepoint. We will use the binary AND operator (&) which copy a bit to the result if it exists in both operands.
The goal of applying a mask is to remove the trailing bits so we compare the actual byte as the first code point. We will do the bitwise operation using 0b1xxxxxxx where 1 will appear "Bytes in sequence" time, and other bits will be 0.
We can then compare with the first byte to verify if it is valid, and also determinate what is the actual byte.
If the character entered in none of the case, it means the byte is invalid and we return "No".
If we can get out of the loop, that means each of the character are valid, hence the string is valid.
Make sure the comparison that returned true correspond to the expected length.

The method would look like this :

public static final boolean isUTF8(final byte[] pText) {

    int expectedLength = 0;

    for (int i = 0; i < pText.length; i++) {
        if ((pText[i] & 0b10000000) == 0b00000000) {
            expectedLength = 1;
        } else if ((pText[i] & 0b11100000) == 0b11000000) {
            expectedLength = 2;
        } else if ((pText[i] & 0b11110000) == 0b11100000) {
            expectedLength = 3;
        } else if ((pText[i] & 0b11111000) == 0b11110000) {
            expectedLength = 4;
        } else if ((pText[i] & 0b11111100) == 0b11111000) {
            expectedLength = 5;
        } else if ((pText[i] & 0b11111110) == 0b11111100) {
            expectedLength = 6;
        } else {
            return false;
        }

        while (--expectedLength > 0) {
            if (++i >= pText.length) {
                return false;
            }
            if ((pText[i] & 0b11000000) != 0b10000000) {
                return false;
            }
        }
    }

    return true;
}

Edit : The actual method is not the original one (almost, but not) and is stolen from here. The original one was not properly working as per @EJP comment.

Upvotes: 16

DoraShine

Reputation: 659

Well, I am grateful for the comments and the answer. First of all, I have to agree that this is "another stupid interview question". It is true that in Java String is already encoded, so it will always be compatible with UTF-8. One way to check it is given a string:

public static boolean isUTF8(String s){
    try{
        byte[]bytes = s.getBytes("UTF-8");
    }catch(UnsupportedEncodingException e){
        e.printStackTrace();
        System.exit(-1);
    }
    return true;
}

However, since all the printable strings are in the unicode form, so I haven't got a chance to get an error.

Second, if given a byte array, it will always be in the range -2^7(0b10000000) to 2^7(0b1111111), so it will always be in a valid UTF-8 range.

My initial understanding to the question was that given a string, say "0b11111111", check if it is a valid UTF-8, I guess I was wrong.

Moreover, Java does provide constructor to convert byte array to string, and if you are interested in the decode method, check here.

One more thing, the above answer would be correct given another language. The only improvement could be:

In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.

So 4 bytes would be enough.

I am definitely to this, so correct me if I am wrong. Thanks a lot.

Upvotes: -1

Implement a function to check if a string/byte array follows utf-8 format

Answers (5)

Related Questions