Compatibility of printf with utf-8 encoded strings

Question

I'm trying to format some utf-8 encoded strings in C code (char *) using the printf function. I need to specify a length in format. Everything goes well when there are no multi-bytes characters in parameter string, but the result seems to be incorrect when there are some multibyte chars in data.

my glibc is kind of old (2.17), so I tried with some online compilers and result is the same.

#include 
#include 

int main(void)
{
    setlocale( LC_CTYPE, "en_US.UTF-8" );
    setlocale( LC_COLLATE, "en_US.UTF-8" );

    printf( "'%-4.4s'
",   "elephant" );
    printf( "'%-4.4s'
",   "éléphant" );
    printf( "'%-20.20s'
", "éléphant" );

    return 0;
}

Result of execution is :

'elep'
'él�'
'éléphant          '

First line is correct (4 chars in output)

Second line is obviously wrong (at least from a human point of view)

Last line is also wrong : only 18 unicode chars are written instead of 20

It seems that the printf function count chars before UTF-8 decoding (counting bytes instead of unicode chars)

Is that a bug in glibc or a well documented limitation of printf ?

rici · Accepted Answer

It's true that printf counts bytes, not multibyte characters. If it's a bug, the bug is in the C standard, not in glibc (the standard library implementation usually used in conjunction with gcc).

In fairness, counting characters wouldn't help you align unicode output either, because unicode characters are not all the same display width even with fixed-width fonts. (Many codepoints are width 0, for example.)

I'm not going to attempt to argue that this behaviour is "well-documented". Standard C's locale facilities have never been particularly adequate to the task, imho, and they have never been particularly well documented, in part because the underlying model attempts to encompass so many possible encodings without ever grounding itself in a concrete example that it is almost impossible to explain. (...Long rant deleted...)

You can use the wchar.h formatted output functions, which count in wide characters. (Which still isn't going to give you correct output alignment but it will count precision the way you expect.)

Compatibility of printf with utf-8 encoded strings

Answers (2)

Related Questions