Reputation: 5543
I am not sure if my assumptions are correct, but I feel that all four kinds of length of a multibyte sequence can be different, to illustrate:
Say, the multibyte encoding is UTF-8, and we have the string "\xc3\xb8 \xe2\x86\x82 e\xcc\x88"
, the UTF-8 encoding of "\u00f8 \u2182 e\u0308"
, "ø ↂ ë"
This string has a length of:
1.) is returned by strlen
and 2.) can be determined with the <wchar.h>
functions.
But is there a portable way of determining 3.) and 4.)? I am not sure, if ↂ taking two cursor positions is defined font-independently for that codepoints or something about the font in use, I feel that “monospaced font” and “some characters take more than one space” is somewhat contradictional. At least, in Monospace this character does cover two cursor positions. The Unicode chart U2150 doesn't say anything about cursor positions.
Lastly, is the number of positions negative for any character (I mean, a character putting the cursor position to the left in a left-to-right script or vice versa)?
Upvotes: 2
Views: 342
Reputation: 241671
The Posix interface wcwidth
can be used to find the number of "cursor positions" of a wchar_t
. In order to get the wchar_t
values (one at a time), you can use the C99 standard library function mbtowc
, which extracts a single multibyte character from a string and returns the number of bytes consumed. (Repeatedly calling mbtowc
on a string and updating the string pointer each time will tell you how many multibyte characters are present in the string, at least if the multibyte coding is UTF-8.)
The combination of wcwidth
and mbtowc
can more or less tell you how many glyphs you have in the string (your question #3). A wchar_t for which wcwidth
returns 0 is either a zero-width format control or a combining character and a wchar_t for which wcwidth
returns -1 is either a non-character or a control character (like \n
). Either way, it can be ignored, so the glyph count is effectively the count of wchar_t whose width is >0.
That makes it clear that the four questions have different answers:
number of bytes.
number of multibyte codepoints.
number of multibyte codepoints whose wcwidth is greater than 0.
sum of the wcwidth of the multibyte codepoints whose wcwidth is greater than 0.
Having said all that, there is no guarantee that the value returned by wcwidth
corresponds either to the actual character widths of the current console font or to the Unicode version which is being used by the application. (I've had trouble with both of these.) The values returned by wcwidth
are extracted from the current locale, so you can edit and recompile your locale files to fix errors. See, for example, my answer here: How to get ncurses to output astral plane unicode characters
Upvotes: 2
Reputation: 47954
But is there a portable way of determining [the characters and the cursor positions]?
Both of these are fuzzy concepts. For example, that the roman 10,000 is two cursor positions in some fonts may depend on how a particular application chooses to present it.
In general, people rely on the platform (e.g., the native text rendering engine) or a library like ICU to get things like cursor positions and shaped glyphs.
Upvotes: 2