How can I determine if a file contains UTF-8 like characters

Question

I am trying to write a program which takes a file as input, iterates the file and then check if the file contains UTF-8 encoded characters.

However I am unsure how to engage the problem of UTF-8 encoding. I understand the basic concept behind the encoding, that it can be stored in 1-4 bytes, where 1 byte is just ASCII representation (0-127).

1 bytes: 0xxxxxxx

For the remainder I believe the pattern to be as such:

2 bytes: 110xxxxx 10xxxxxx

3 bytes: 1110xxxx 10xxxxxx 10xxxxxx

4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

However, I struggle in realizing how to implement this in C code. I know how I would iterate the file, and do something if the predicate of UTF-8 encoding holds:

while ((check = fgetc(fp)) != EOF) {
        if (*) {
        // do something to the code
    }
}

However, I am unsure how to actually modify and implement the encoding of UTF-8 into C (or any language which does not have a build in function to do this, such as C# UTF8Encoding e.g.).

As a simple example using a similar logic to ASCII would just have me iterating over each character (pointed to be the check variable) and verify whether it is within the ASCII character limits:

if (check >= 0 && check <= 127) {
    // do something to the code
}

Can anyone try and explain to me how I would engage a similar logic, only when trying to determine if the check variable is pointing to a UTF-8 encoded character instead?

jpsalm · Accepted Answer

if ( (ch & 0x80) == 0x0 ) {
  //ascii byte
}
else if ( (ch & 0xe0) == 0xc0 ) {
  // 2 bytes
}
else if ( (ch & 0xf0) == 0xe0 ) {
 // 3 bytes
}
else if ( (ch & 0xf8) == 0xf0 ) {
  // 4 bytes
}

You want to bitwise & the first x bits and check that the first x-1 bits are 1. It helps to write out the numbers in binary and follow along.

How can I determine if a file contains UTF-8 like characters

Answers (2)

Related Questions