Reputation: 256651
i'm need to write a function that will flip all the characters of a string left-to-right.
e.g.:
Thė quiçk ḇrown fox jumṕềᶁ ovểr thë lⱥzy ȡog.
should become
.goȡ yzⱥl ëht rểvo ᶁềṕmuj xof nworḇ kçiuq ėhT
i can limit the question to UTF-16 (which has the same problems as UTF-8, just less often).
A naive solution might try to flip all the things (e.g. word-for-word, where a word is 16-bits - i would have said byte for byte if we could assume that a byte was 16-bits. i could also say character-for-character where character is the data type Char
which represents a single code-point):
String original = "ɗỉf̴ḟếr̆ęnͥt";
String flipped = "";
foreach (Char c in s)
{
flipped = c+fipped;
}
Results in the incorrectly flipped text:
ɗỉf̴ḟếr̆ęnͥt
̨tͥnę̆rếḟ̴fỉɗ
This is because one "character" takes multiple "code points".
ɗỉf̴ḟếr̆ęnͥt
ɗ
ỉ
f
˜
ḟ
ế
r
˘
ę
n
i
t
˛
and flipping each "code point" gives:
˛
t
i
n
ę
˘
r
ế
ḟ
˜
f
ỉ
ɗ
Which not only is not a valid UTF-16 encoding, it's not the same characters.
The problem happens in UTF-16 encoding when there is:
Those same issues happen in UTF-8 encoding, with the additional case
i can limit myself to the simpler UTF-16 encoding (since that's the encoding that the language that i'm using has (e.g. C#, Delphi)
The problem, it seems to me, is discovering if a number of subsequent code points are combining characters, and need to come along with the base glyph.
It's also fun to watch an online text reverser site fail to take this into account.
Note:
- any solution should assume that don't have access to a UTF-32 encoding library (mainly becuase i don't have access to any UTF-32 encoding library)
- access to a UTF-32 encoding library would solve the UTF-8/UTF-16 lingual planes problem, but not the combining diacritics problem
Upvotes: 7
Views: 5181
Reputation: 2134
Text Mechanic's Text Generator seems to do this in JavaScript. I'm sure it would be possible to translate the JS into another language after obtaining the author's consent (if you can find a 'contact' link for that site).
Upvotes: -1
Reputation: 536359
The term you're looking for is “grapheme cluster”, as defined in Unicode TR29 Cluster Boundaries.
Group the UTF-16 code units into Unicode code points (=characters) using the surrogate algorithm (easy), then group the characters into grapheme clusters using the Grapheme_Cluster_Break rules. Finally reverse the group order.
You will need a copy of the Unicode character database in order to recognise grapheme cluster boundaries. That's already going to take up a considerable amount of space, so you're probably going to want to get a library to do it. For example in ICU you might use a CharacterIterator (which is misleadingly named as it works on grapheme clusters, not ‘characters’ as Unicode knows it).
Upvotes: 3
Reputation: 47952
If you work in UTF-32, you solve the non-base-plane issue. Converting from UTF-8 or UTF-16 to UTF-32 (and back) is relatively simple bit twiddling (see Wikipedia). You don't have to have a library for it.
Most of the combining characters are in a few ranges. You could determine those ranges by scanning the Unicode database (see Unicode.org). Hardcode those ranges into your application. With that, you can determine the groups of codepoints that represent a single character. (The drawback is that new combining marks could be introduced in the future, and you'd need to update your table.)
Segment appropriately, reverse the order (segment by segment), and convert back to UTF-8 or UTF-16 (or whatever you want).
Upvotes: 2