Length of a multibyte sequence in bytes, (unicode) code points, characters and cursor positions

Question

I am not sure if my assumptions are correct, but I feel that all four kinds of length of a multibyte sequence can be different, to illustrate:

Say, the multibyte encoding is UTF-8, and we have the string "\xc3\xb8 \xe2\x86\x82 e\xcc\x88", the UTF-8 encoding of "\u00f8 \u2182 e\u0308", "ø ↂ ë"

This string has a length of:

10 bytes
6 unicode code-points
5 characters
6 screen positions (with a monospaced font) (ↂ takes 2 positions)

1.) is returned by strlen and 2.) can be determined with the functions.

But is there a portable way of determining 3.) and 4.)? I am not sure, if ↂ taking two cursor positions is defined font-independently for that codepoints or something about the font in use, I feel that “monospaced font” and “some characters take more than one space” is somewhat contradictional. At least, in Monospace this character does cover two cursor positions. The Unicode chart U2150 doesn't say anything about cursor positions.

Lastly, is the number of positions negative for any character (I mean, a character putting the cursor position to the left in a left-to-right script or vice versa)?

rici · Accepted Answer

The Posix interface wcwidth can be used to find the number of "cursor positions" of a wchar_t. In order to get the wchar_t values (one at a time), you can use the C99 standard library function mbtowc, which extracts a single multibyte character from a string and returns the number of bytes consumed. (Repeatedly calling mbtowc on a string and updating the string pointer each time will tell you how many multibyte characters are present in the string, at least if the multibyte coding is UTF-8.)

The combination of wcwidth and mbtowc can more or less tell you how many glyphs you have in the string (your question #3). A wchar_t for which wcwidth returns 0 is either a zero-width format control or a combining character and a wchar_t for which wcwidth returns -1 is either a non-character or a control character (like ). Either way, it can be ignored, so the glyph count is effectively the count of wchar_t whose width is >0.

That makes it clear that the four questions have different answers:

number of bytes.
number of multibyte codepoints.
number of multibyte codepoints whose wcwidth is greater than 0.
sum of the wcwidth of the multibyte codepoints whose wcwidth is greater than 0.

Having said all that, there is no guarantee that the value returned by wcwidth corresponds either to the actual character widths of the current console font or to the Unicode version which is being used by the application. (I've had trouble with both of these.) The values returned by wcwidth are extracted from the current locale, so you can edit and recompile your locale files to fix errors. See, for example, my answer here: How to get ncurses to output astral plane unicode characters

Length of a multibyte sequence in bytes, (unicode) code points, characters and cursor positions

Answers (2)

Related Questions