Reputation: 1828

How to deal with (presumable) UTF-8 strings in C programs on OSX

Hopefully the question title describes my issue well enough.

Platform: OSX 10.8, llvm with clang++ compiler

I have got a directory with filenames in Japanese or Cyrillic characters. Those filenames are displayed correctly (e.g. via ls) in iTerm2 with en_EN.UTF-8 locale and Monaco 10 font (not sure if locale/font make a difference, but it seems it should). A vanilla xterm without UTF-8 support, however, prints scrambled symbols or '?' characters for non-ASCII characters.

Here is the actual question:

In C++ program, I use readdir() from dirent.h to list the contents of a directory containing filenames in Japanese or Cyrillic characters. Printing the d_name property of the struct dirent result of readdir() displays the correct characters in the Xcode terminal. That is, e.g. Japanese kanji really are displayed as such. Same is true when executing the program from iTerm2. Again, scrambled characters in non-UFT-8 xterm.

Since the byte size of Japanese filenames does not equal the number of characters displayed, I boldly assume, the dirent.h functions work with UTF-8 strings. Is it possible that all of the OSX C-Library works that way?
Therefor, is it safe to e.g. alter the struct dirent.d_name or strcpy it and create a new file using that altered string? Is it possible to step in some trap that leads to '?????' filenames being written instead of kanji?
Would setting a different locale, e.g. "C", mess things up (does not seem that way when using setlocale(LC_ALL,"C")).

Note: I am not interested in possible 3rd party alternatives to dirent.h. I wrote the program solely to shed some light on how OSX deals with locale and character encoding.

Upvotes: 4

Answers (2)

Graham Borland

Reputation: 60691

UTF-8 is designed to be backwards-compatible with ASCII from the point of view of legacy string-handling code. This includes strcpy() and friends.

So yes, in your code it's generally safe to handle these strings as you would any other string^*; it's only at display time that the clever stuff happens.

^{* as long as you're not meddling with individual characters in the string.}

Upvotes: 1

yiding

Reputation: 3592

A valid UTF8 string doesn't contain any null characters, so any string operations should work on UTF8 encoded strings. You probably do not want to take substrings of it or modify the bytes in it though, since some of the characters are encoded in multiple bytes.

Most of the APIs which handle char* are not aware and doesn't care about the encoding, so they should be safe to use.

setlocale will affect certain operations, mostly related to dealing with character types, ordering, and formatting.

When you print the string, it goes out as a bunch of bytes. The terminal emulator interprets it as UTF8 and pick the correct characters. xterm, being unaware of unicode, will of course not be able to interpret it correctly and display the proper characters.

Upvotes: 1

How to deal with (presumable) UTF-8 strings in C programs on OSX

Answers (2)

Related Questions