Reputation: 351
I'm working on Unicode support in a Linux console application. I ran into a need to change the screen buffer format to store Unicode glyphs instead of bytes representing ASCII characters. Unicode has combined characters, hence more than one Unicode code point can be rendered into one console cell.
The question is: what is the maximum number of Unicode combined characters that may be needed to render one glyph in real-life languages? Are there any languages ββin the world that have glyphs that need more than 8 combined characters to render, for example? Let's assume that I don't need "Zalgo text" support at the cost of performance degradation caused by implementing dynamic length variables to store each console buffer glyph.
Upvotes: 1
Views: 415
Reputation: 5635
Nobody can be an expert in what makes up a "real-life" character in every language, so I might be missing some longer sequences here. But I do know about a lot of emoji! There are a few emojis for flags of geographic subdivisions which are implemented with combining codepoints. For example, the flag for Scotland, π΄σ §σ ’σ ³σ £σ ΄σ Ώ, is 7 codepoints, taking up 28 bytes in UTF-32:
Country flags, like π―π΅, have just two combining codepoints.
Family emojis with 4 people, like π©βπ©βπ§βπ§, are also 7 codepoints. The only emoji I'm aware of that's longer are family emojis with a skin-tone specified for each family member, but these don't have a lot of support right now. Here's what one displays as on your device: π©πΎβπ¨πΎβπ§πΎβπ§πΎ (if you just see four heads, then you don't have a font installed that supports this). That emoji has 11 codepoints.
That being said, keep in mind that not all languages are rendered as a series of glyphs in sequence: Ψ£ΩΩΨ§
is segmented using Unicode rules into 4 distinct characters.
Upvotes: 3