mats
mats

Reputation: 1828

How to deal with (presumable) UTF-8 strings in C programs on OSX

Hopefully the question title describes my issue well enough.

Platform: OSX 10.8, llvm with clang++ compiler

I have got a directory with filenames in Japanese or Cyrillic characters. Those filenames are displayed correctly (e.g. via ls) in iTerm2 with en_EN.UTF-8 locale and Monaco 10 font (not sure if locale/font make a difference, but it seems it should). A vanilla xterm without UTF-8 support, however, prints scrambled symbols or '?' characters for non-ASCII characters.

Here is the actual question:

In C++ program, I use readdir() from dirent.h to list the contents of a directory containing filenames in Japanese or Cyrillic characters. Printing the d_name property of the struct dirent result of readdir() displays the correct characters in the Xcode terminal. That is, e.g. Japanese kanji really are displayed as such. Same is true when executing the program from iTerm2. Again, scrambled characters in non-UFT-8 xterm.

Note: I am not interested in possible 3rd party alternatives to dirent.h. I wrote the program solely to shed some light on how OSX deals with locale and character encoding.

Upvotes: 4

Views: 1509

Answers (2)

Graham Borland
Graham Borland

Reputation: 60691

UTF-8 is designed to be backwards-compatible with ASCII from the point of view of legacy string-handling code. This includes strcpy() and friends.

So yes, in your code it's generally safe to handle these strings as you would any other string*; it's only at display time that the clever stuff happens.

* as long as you're not meddling with individual characters in the string.

Upvotes: 1

yiding
yiding

Reputation: 3592

A valid UTF8 string doesn't contain any null characters, so any string operations should work on UTF8 encoded strings. You probably do not want to take substrings of it or modify the bytes in it though, since some of the characters are encoded in multiple bytes.

Most of the APIs which handle char* are not aware and doesn't care about the encoding, so they should be safe to use.

setlocale will affect certain operations, mostly related to dealing with character types, ordering, and formatting.

When you print the string, it goes out as a bunch of bytes. The terminal emulator interprets it as UTF8 and pick the correct characters. xterm, being unaware of unicode, will of course not be able to interpret it correctly and display the proper characters.

Upvotes: 1

Related Questions