NewDev90
NewDev90

Reputation: 399

How can I determine if a file contains UTF-8 like characters

I am trying to write a program which takes a file as input, iterates the file and then check if the file contains UTF-8 encoded characters.

However I am unsure how to engage the problem of UTF-8 encoding. I understand the basic concept behind the encoding, that it can be stored in 1-4 bytes, where 1 byte is just ASCII representation (0-127).

1 bytes: 0xxxxxxx

For the remainder I believe the pattern to be as such:

2 bytes: 110xxxxx 10xxxxxx

3 bytes: 1110xxxx 10xxxxxx 10xxxxxx

4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

However, I struggle in realizing how to implement this in C code. I know how I would iterate the file, and do something if the predicate of UTF-8 encoding holds:

while ((check = fgetc(fp)) != EOF) {
        if (*) {
        // do something to the code
    }
}

However, I am unsure how to actually modify and implement the encoding of UTF-8 into C (or any language which does not have a build in function to do this, such as C# UTF8Encoding e.g.).

As a simple example using a similar logic to ASCII would just have me iterating over each character (pointed to be the check variable) and verify whether it is within the ASCII character limits:

if (check >= 0 && check <= 127) {
    // do something to the code
}

Can anyone try and explain to me how I would engage a similar logic, only when trying to determine if the check variable is pointing to a UTF-8 encoded character instead?

Upvotes: 1

Views: 1644

Answers (2)

jpsalm
jpsalm

Reputation: 328

if ( (ch & 0x80) == 0x0 ) {
  //ascii byte
}
else if ( (ch & 0xe0) == 0xc0 ) {
  // 2 bytes
}
else if ( (ch & 0xf0) == 0xe0 ) {
 // 3 bytes
}
else if ( (ch & 0xf8) == 0xf0 ) {
  // 4 bytes
}

You want to bitwise & the first x bits and check that the first x-1 bits are 1. It helps to write out the numbers in binary and follow along.

Upvotes: 1

R.. GitHub STOP HELPING ICE
R.. GitHub STOP HELPING ICE

Reputation: 215193

UTF-8 is not hard, but it is stricter than what you realize and what jpsalm's answer suggests. If you want to test that it's valid UTF-8, you need to determine that it conforms to the definition, expressed in ABNF in RFC 3629:

UTF8-octets = *( UTF8-char )
UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1      = %x00-7F
UTF8-2      = %xC2-DF UTF8-tail
UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
              %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
              %xF4 %x80-8F 2( UTF8-tail )
UTF8-tail   = %x80-BF

Alternatively, you can do a bunch of math checking for "non shortest form" and other stuff (surrogate ranges), but that's a huge pain, and highly error-prone. Almost every single implementation I've ever seen done this way, even in major widely used software, has been outright wrong on at least one thing. A state machine that accepts UTF-8 is easy to do and easy to verify that it matches the formal definition. One nice, clean, readable one is described in detail at https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Upvotes: 0

Related Questions