unxed
unxed

Reputation: 351

What is the maximum number of Unicode combined characters that may be needed to render one glyph in real-life languages?

I'm working on Unicode support in a Linux console application. I ran into a need to change the screen buffer format to store Unicode glyphs instead of bytes representing ASCII characters. Unicode has combined characters, hence more than one Unicode code point can be rendered into one console cell.

The question is: what is the maximum number of Unicode combined characters that may be needed to render one glyph in real-life languages? Are there any languages ​​in the world that have glyphs that need more than 8 combined characters to render, for example? Let's assume that I don't need "Zalgo text" support at the cost of performance degradation caused by implementing dynamic length variables to store each console buffer glyph.

Upvotes: 1

Views: 415

Answers (1)

loops
loops

Reputation: 5635

Nobody can be an expert in what makes up a "real-life" character in every language, so I might be missing some longer sequences here. But I do know about a lot of emoji! There are a few emojis for flags of geographic subdivisions which are implemented with combining codepoints. For example, the flag for Scotland, 🏴󠁧󠁒󠁳󠁣󠁴󠁿, is 7 codepoints, taking up 28 bytes in UTF-32:

  • WAVING BLACK FLAG
  • TAG LATIN SMALL LETTER G
  • TAG LATIN SMALL LETTER B
  • TAG LATIN SMALL LETTER S
  • TAG LATIN SMALL LETTER C
  • TAG LATIN SMALL LETTER T
  • CANCEL TAG

Country flags, like πŸ‡―πŸ‡΅, have just two combining codepoints.

Family emojis with 4 people, like πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§, are also 7 codepoints. The only emoji I'm aware of that's longer are family emojis with a skin-tone specified for each family member, but these don't have a lot of support right now. Here's what one displays as on your device: πŸ‘©πŸΎβ€πŸ‘¨πŸΎβ€πŸ‘§πŸΎβ€πŸ‘§πŸΎ (if you just see four heads, then you don't have a font installed that supports this). That emoji has 11 codepoints.

That being said, keep in mind that not all languages are rendered as a series of glyphs in sequence: Ψ£Ω‡Ω„Ψ§ is segmented using Unicode rules into 4 distinct characters.

Upvotes: 3

Related Questions