Remo.D
Remo.D

Reputation: 16512

Check for valid UTF-8 encoding in C

Say you have an UTF-8 encoded string s. You extract the first bytes that appear to be an UTF-8 encoded codepoint and put them into a 32 bit integer c.

For example:

The problem is to check if c is a valid encoding or not.
The function I wrote is the one below but I'm not sure if this is correct (I might miss some edge cases).

I know I can go byte by byte following the FSM specified by the standard but I need to check if this is a correct approach.

int chr_isvalid(uint32_t c)
{
  if (c <= 0x7F) return 1;
  if (0xC080 == c) return 1;   // Accept 0xC080 as representation for '\0'
  if (0xC280 <= c && c <= 0xDFBF) return ((c & 0xE0C0) == 0xC080);
  if (0xEDA080 <= c && c <= 0xEDBFBF) return 0; // Reject UTF-16 surrogates
  if (0xE0A080 <= c && c <= 0xEFBFBF) return ((c & 0xF0C0C0) == 0xE08080);
  if (0xF0908080 <= c && c <= 0xF48FBFBF) return ((c & 0xF8C0C0C0) == 0xF0808080);
  return 0;
}

CLARIFICATIONS:

To better pinpoint my mistakes, please provide an example of a valid encoded codepoint that this code rejects or an invalid encoding that this code accepts as valid.

Upvotes: 4

Views: 5589

Answers (4)

Remo.D
Remo.D

Reputation: 16512

Self-Response

To show why I believe this is correct, I'll summarize here my reasoning. Please point out anything that I might have missed.

I will try to show that:

  1. All valid encodings are accepted (easier).
  2. All invalid encodings are rejected (trickier).

This is the code for reference:

1:  if (c <= 0x7F) return 1;
2:  if (0xC080 == c) return 1;   // Accept 0xC080 as representation for '\0'
3:  if (0xC280 <= c && c <= 0xDFBF) return ((c & 0xE0C0) == 0xC080);
4:  if (0xEDA080 <= c && c <= 0xEDBFBF) return 0; // Reject UTF-16 surrogates
5:  if (0xE0A080 <= c && c <= 0xEFBFBF) return ((c & 0xF0C0C0) == 0xE08080);
6:  if (0xF0908080 <= c && c <= 0xF48FBFBF) return ((c & 0xF8C0C0C0) == 0xF0808080);
7:  return 0;

1) All valid encodings are accepted

Breaking down by the number of encoding bytes, I'll show that the valid encodings for the range U+000000 - U+10FFFF are accepted.

1a) 1-byte (U+0000 - U+007F)

Valid ASCII encoding (ranging from 0x00 to 0x7F) are accepted by line 1.

1b) 2-bytes (U+0080 - U+07FF)

Correct encodings for U+0080 is 0xC280, for U+07FF is 0xDFBF all the in-between codepoints are within this range due the UTF-8 encoding properties.
This is checked in line 3.
A valid encoding in this range must be in the form 110xxxxx 10xxxxxx meaning that masking the x bits we must have:

110xxxxx 10xxxxxx  &
11100000 11000000      <-- 0xE0C0
-------- --------
11000000 10000000      <-- 0xC080

Hence, all valid 2-bytes encoding are accepted by line 3.

Line 2 manages the special case for Modified UTF-8 that encodes U+0000 as 0xC080 (see https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 ).

1c) 3-bytes (U+0800 - U+FFFF)

Correct encodings for U+0800 is 0xE0A080, for U+FFFF is 0xEFBFBF all the in-between codepoints are within this range due the UTF-8 encoding properties.
This is checked in line 3.
A valid encoding in this range must be in the form 1110xxxx 10xxxxxx 10xxxxxx meaning that masking the x bits we must have:

1110xxxx 10xxxxxx 10xxxxxx  &
11110000 11000000 11000000     <-- 0xF0C0C0
-------- -------- --------
11100000 10000000 10000000     <-- 0xE08080

Hence, all valid 3-bytes encoding are accepted by line 5.

1d) 4-bytes (U+010000 - U+10FFFF)

Correct encodings for U+010000 is 0xF0908080, for U+10FFFF is 0xF48FBFBF all the in-between codepoints are within this range due the UTF-8 encoding properties.
This is checked in line 3.
A valid encoding in this range must be in the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx meaning that masking the x bits we must have:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  &
11111000 11000000 11000000 11000000     <-- 0xF8C0C0C0
-------- -------- -------- --------
11110000 10000000 10000000 10000000     <-- 0xF0808080

Hence, all valid 4-bytes encoding are accepted by line 6.

2) All invalid encodings are rejected

This is more tricky. I'll break them down by types of invalidity.

2a) Non ASCII single byte value (0x80 - 0xFF)

This includes:

  • possible stray continuation byte (0x80-0xBF)
  • invalid start byte (0xC0-0xC1, 0xF5-0xFF)
  • valid starting byte (0xC2-0xF4) not followed by a continuation byte

None of this values are in the range accepted by the lines 1-6, then line 7 will reject them.

2b) Missing continuation bytes

The case for having no continuation bytes at all is covered in 2a
If a supposedly 3-byte encoding is missing one, it means that the candidate codepoint is in the range 0xE000-0xEFFF which is not accepted by any of the line 1-6 and, hence, is rejected.

If a supposedly 4-byte encoding is missing two, it means that the candidate codepoint is in the range 0xF000-0xFFFF which is not accepted by any of the line 1-6 and, hence, is rejected.

If a supposedly 4-byte encoding is missing one, it means that the candidate codepoint is in the range 0xF00000-0xFFFFFF which is not accepted by any of the line 1-6 and, hence, is rejected.

2c) Invalid "continuation" byte

If one of the continuation byte is outside the valid range (0x80-0xBF) it wil be rejected by the masking operation in lines 3,5 and 6.

For example for 0xC26A (which is in the range accepted by line 3) the value 0x6A is invalid. In fact it will be rejected because:

11000010 01101010  &   <-- 0xC26A
11100000 11000000      <-- 0xE0C0
-------- --------
11000000 01000000      <-- 0xC040 (expected 0xC080)

Similarly for 0xE3DE82 (which is in the range accepted by line 5) the value 0xDE is invalid. In fact it will be rejected because:

11100011 11011110 10000010  &  <-- 0xE3DE82
11110000 11000000 11000000     <-- 0xF0C0C0
-------- -------- --------
11100000 11000000 10000000     <-- 0xE0C080 (expected 0xE08080)

Any value outside 0x80-0xBF when masked with 0xC0 will result in a value different from 0x80 and it will be rejected.

2d) UTF-16 surrogates

Their encodings are explicitly rejected by line 4.

2e) Overlong encodings

To create an overlong (invalid) encoding, the codepoint is extended to the left with 0s and then the encoding for the corresponding number of bits is used.

For example, let's say we want to create a 2-bytes encoding for 'A' (U+41).
We consider the codepoint to be on 11 bits (below named abcdefhijk from the least to the most significant one) and use the encoding rules for 2 bytes:

 |----------| 11 bits
 kji hgfedcba -> 110kjihg 10fedcba
 000 01000001 -> 11000001 10000001   (U+41 -> 0xC181)

but since the bits from k to h are 0, the resulting code will be always lower than 0xC280 and, hence, not in any range accepted by lines 1-6.

As another example, let's build a 3-byte encoding for the letter 'è' (U+E8):

   |--------------| 16 bits
   ponmlkj hgfedcba -> 1110ponm 10lkjihg 10fedcba
   0000000 11101000 -> 11100000 10000011 10101000   (U+E8 -> 0xE083A4)

Which has the bits from p to l equal to 0 and, hence, is outside the accepeted range (it is lower than E0A080, the minimun 3-byte encoding).

In other words: any overlong encoding is rejected as it would be lower than the minimun encoding values accepted by lines 1-6.

2f) Codepoints above U+10FFFF

Their encoding will be greater than 0xF48FBFBF and, hence, not in the range of any accepted value.

Upvotes: 4

Luis Colorado
Luis Colorado

Reputation: 12645

To decode a utf-8 byte sequence into code points, a prefix of 1 bits, followed by a 0 bit is used in the first byte of the sequence

0xxxxxxx  <-- code point is 7 bit only, and can be in the range \u0000 to \u007f
110xxxxx  10xxxxxx  <-- code point is 11 bit only, and can be in the range \u0080 to \u007ff
1110xxxx  10xxxxxx 10xxxxxx  <-- code point is 16 bit and can be in the range \u0800 to \uffff
11110xxx  10xxxxxx 10xxxxxx 10xxxxxx <-- codepoint is 21bit and can be in the range \U00010000 to \U0010ffff

Think that it is illegal to encode a utf-8 sequence more bytes than the minimum necessary. So encoding \u0000 as

11000000 10000000

is illegal, and it should be encoded as

00000000

The decoding of 0xc3 0xa8 into the Unicode codepoint 0xc3a8 is erroneous, because you first have to eliminate the extra bits added for encoding:

   c3       a8
11000011 10101000
   vvvvv   vvvvvv  these are the bits taken to form the code point.
   |||||   ||||||
   00011   101000
   |||||  //////
   ||||| //////
   vvvvvvvvvvv
   00011101000     this is the decoded codepoint:
     0   e   8     0xe8

And it's corresponding codepoint is \u00e8.

I recommend you to read the corresponding chapter of the Unicode specification, more preciselly sections 2.4 Code Points and Characters, and 2.5 Encoding Forms, this last covering the different encodings of Unicode.

Upvotes: 0

R.. GitHub STOP HELPING ICE
R.. GitHub STOP HELPING ICE

Reputation: 215193

UTF-8 is defined in RFC 3629, and equivalently in the Unicode standard and in ISO 10646. The first has the advantage of using a simple ABNF description of the syntax for what byte sequences are valid. Your function will have to replicate this, which has no easy shortcuts by working with a 32-bit integer as input; the obvious solution is to break it back down into bytes and execute a DFA on them. There are some optimized vectorized implementations working on whole vectors, but they depend on ability to do range checks on individual bytes within the vector, which is not easy to implement with 32-bit arithmetic.

Upvotes: 1

JoelFan
JoelFan

Reputation: 38704

I'm not sure of what your total requirements are, but the way to recognize a valid UTF-8 code point is to count the number of leading 1 bits on the first byte to tell you the length of the sequence. Then each subsequent byte in the sequence must start with "10".

(If the first byte starts with a leading "0", then it's a 1-byte sequence, aka ASCII)

Upvotes: 0

Related Questions