navyad
navyad

Reputation: 3860

Find non-ascii characters from a UTF-8 string

I need to find the non-ASCII characters from a UTF-8 string.

my understanding: UTF-8 is a superset of character encoding in which 0-127 are ascii characters. So if in a UTF-8 string , a characters value is Not between 0-127, then it is not a ascii character , right? Please correct me if i'm wrong here.

On the above understanding i have written following code in C :

Note: I'm using the Ubuntu gcc compiler to run C code

utf-string is x√ab c

long i;
    char arr[] = "x√ab c";
    printf("length : %lu \n", sizeof(arr));
        for(i=0; i<sizeof(arr); i++){

        char ch = arr[i];
        if (isascii(ch))
             printf("Ascii character %c\n", ch);
              else
             printf("Not ascii character %c\n", ch);
    }

Which prints the output like:

length : 9 
Ascii character x
Not ascii character 
Not ascii character �
Not ascii character �
Ascii character a
Ascii character b
Ascii character  
Ascii character c
Ascii character 

To naked eye length of x√ab c seems to be 6, but in code it is coming as 9 ? Correct answer for the x√ab c is 1 ...i.e it has only 1 non-ascii character , but in above output it is coming as 3 (times Not ascii character).

How can i find the non-ascii character from UTF-8 string, correctly.

Please guide on the subject.

Upvotes: 3

Views: 11150

Answers (2)

Rohit Jose
Rohit Jose

Reputation: 188

The UTF-8 characters when taken in a character array occupies it in such a way that the first byte occupied by each UTF-8 character would contain the information regarding the number of bytes taken to represent the character. The number of consecutive 1's from the MSB of the first byte would represent the total bytes taken by the non-ascii character. In case of '√' the binary form would be: 11100010,10001000,10011010. Counting the number of 1's the in the first byte gives the number of bytes occupied as 3. Something like the code below would work for this:

int get_count(char non_ascii_char){
        /* 
           The function returns the number of bytes occupied by the UTF-8 character
           It takes the non ASCII character as the input and returns the length 
           to the calling function.
        */
        int bit_counter=7,count=0;
        /*
           bit_counter -  is the counter initialized to traverse through each bit of the 
           non ascii character
           count - stores the number of bytes occupied by the character
        */

        for(;bit_counter>=0;bit_counter--){
            if((non_ascii_char>>bit_counter)&1){
                count++;// increments on the number of consecutive 1s in the byte
            }
            else{
                break;// breaks on encountering the first 0
            }
        }

        return count;// returns the count to the calling function
    }

Upvotes: 3

Joachim Sauer
Joachim Sauer

Reputation: 308061

What C calls a char is actually a byte. A UTF-8 character can be made up of several bytes.

In fact only the ASCII characters are represented by a single byte in UTF-8 (which is why all valid ASCII-encoded text is also effectively UTF-8 encoded).

So to count the number of UTF-8 characters you have to do a partial decoding: count the number of UTF-8 start codepoints.

See the Wikipedia article on UTF-8 to find out how they are encoded.

Basically there are 3 categories:

  • single-byte codes 0b0xxxxxxx
  • start bytes: 0b110xxxxx, 0b1110xxxx, 0b11110xxx
  • continuation bytes: 0b10xxxxxx

To count the number of unicode codepoint simply count all characters that are not continuation bytes.

However unicode codepoints don't always have a 1-to-1 correspondence to "characters" (depending on your exact definition of character).

Upvotes: 6

Related Questions