Find non-ascii characters from a UTF-8 string

Question

I need to find the non-ASCII characters from a UTF-8 string.

my understanding: UTF-8 is a superset of character encoding in which 0-127 are ascii characters. So if in a UTF-8 string , a characters value is Not between 0-127, then it is not a ascii character , right? Please correct me if i'm wrong here.

On the above understanding i have written following code in C :

Note: I'm using the Ubuntu gcc compiler to run C code

utf-string is x√ab c

long i;
    char arr[] = "x√ab c";
    printf("length : %lu 
", sizeof(arr));
        for(i=0; i



Which prints the output like:

length : 9 
Ascii character x
Not ascii character 
Not ascii character �
Not ascii character �
Ascii character a
Ascii character b
Ascii character  
Ascii character c
Ascii character 


To naked eye length of x√ab c seems to be 6, but in code it is coming as 9 ?
Correct answer for the x√ab c is 1 ...i.e it has only 1 non-ascii character , but in above output it is coming as 3 (times Not ascii character).

How can i find the non-ascii character from UTF-8 string, correctly.

Please guide on the subject.

Joachim Sauer · Accepted Answer

What C calls a char is actually a byte. A UTF-8 character can be made up of several bytes.

In fact only the ASCII characters are represented by a single byte in UTF-8 (which is why all valid ASCII-encoded text is also effectively UTF-8 encoded).

So to count the number of UTF-8 characters you have to do a partial decoding: count the number of UTF-8 start codepoints.

See the Wikipedia article on UTF-8 to find out how they are encoded.

Basically there are 3 categories:

single-byte codes 0b0xxxxxxx
start bytes: 0b110xxxxx, 0b1110xxxx, 0b11110xxx
continuation bytes: 0b10xxxxxx

To count the number of unicode codepoint simply count all characters that are not continuation bytes.

However unicode codepoints don't always have a 1-to-1 correspondence to "characters" (depending on your exact definition of character).

Find non-ascii characters from a UTF-8 string

Answers (2)

Related Questions