xiaohan2012
xiaohan2012

Reputation: 10332

Dealing with Chinese characters in C string manipulation

It is known that in C, a string is represented by an array of chars.

And in most 32-bit processors, a char takes one byte or eight bits. And a string consists of an array of one bytes.

Because extended characters like Chinese and Japanese takes up more bits than 8 bits, I am getting a little confused about the stuff around this.

For example, I tested that I can define an array of Chinese characters the same way an array of English letters is defined, using syntax likechar array[100]. So my question is:

Is there a mechanism that attempts to bridge the gap between general 8-bits characters and greater-than-8-bits characters so that they are treated like the same, just like what I have mentioned above.

Upvotes: 4

Views: 1836

Answers (2)

user529758
user529758

Reputation:

I'd suggest using the UTF8 string encoding, as it makes possible to use normal (byte <= 127) characters as usually, and in addition, you'll be able to use the two-, three-, or four-byte characters by detecting an Unicode control character (byte >= 128). You also can use libiconv for some related problems. http://www.gnu.org/software/libiconv/

Upvotes: 0

Ofir
Ofir

Reputation: 8362

Yes, using multi-byte character encodings. This is a rather wide subject, but start with the following:

Upvotes: 3

Related Questions