Reputation: 399
I am trying to write a program which takes a file as input, iterates the file and then check if the file contains UTF-8 encoded characters.
However I am unsure how to engage the problem of UTF-8 encoding. I understand the basic concept behind the encoding, that it can be stored in 1-4 bytes, where 1 byte is just ASCII representation (0-127).
1 bytes: 0xxxxxxx
For the remainder I believe the pattern to be as such:
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
However, I struggle in realizing how to implement this in C code. I know how I would iterate the file, and do something if the predicate of UTF-8 encoding holds:
while ((check = fgetc(fp)) != EOF) {
if (*) {
// do something to the code
}
}
However, I am unsure how to actually modify and implement the encoding of UTF-8 into C (or any language which does not have a build in function to do this, such as C# UTF8Encoding e.g.).
As a simple example using a similar logic to ASCII would just have me iterating over each character (pointed to be the check variable) and verify whether it is within the ASCII character limits:
if (check >= 0 && check <= 127) {
// do something to the code
}
Can anyone try and explain to me how I would engage a similar logic, only when trying to determine if the check variable is pointing to a UTF-8 encoded character instead?
Upvotes: 1
Views: 1644
Reputation: 328
if ( (ch & 0x80) == 0x0 ) {
//ascii byte
}
else if ( (ch & 0xe0) == 0xc0 ) {
// 2 bytes
}
else if ( (ch & 0xf0) == 0xe0 ) {
// 3 bytes
}
else if ( (ch & 0xf8) == 0xf0 ) {
// 4 bytes
}
You want to bitwise & the first x bits and check that the first x-1 bits are 1. It helps to write out the numbers in binary and follow along.
Upvotes: 1
Reputation: 215193
UTF-8 is not hard, but it is stricter than what you realize and what jpsalm's answer suggests. If you want to test that it's valid UTF-8, you need to determine that it conforms to the definition, expressed in ABNF in RFC 3629:
UTF8-octets = *( UTF8-char ) UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4 UTF8-1 = %x00-7F UTF8-2 = %xC2-DF UTF8-tail UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) / %xF4 %x80-8F 2( UTF8-tail ) UTF8-tail = %x80-BF
Alternatively, you can do a bunch of math checking for "non shortest form" and other stuff (surrogate ranges), but that's a huge pain, and highly error-prone. Almost every single implementation I've ever seen done this way, even in major widely used software, has been outright wrong on at least one thing. A state machine that accepts UTF-8 is easy to do and easy to verify that it matches the formal definition. One nice, clean, readable one is described in detail at https://bjoern.hoehrmann.de/utf-8/decoder/dfa/
Upvotes: 0