Ian Boyd
Ian Boyd

Reputation: 256651

How to flip text horizontally?

i'm need to write a function that will flip all the characters of a string left-to-right.

e.g.:

Thė quiçk ḇrown fox jumṕềᶁ ovểr thë lⱥzy ȡog.

should become

.goȡ yzⱥl ëht rểvo ᶁềṕmuj xof nworḇ kçiuq ėhT

i can limit the question to UTF-16 (which has the same problems as UTF-8, just less often).

Naive solution

A naive solution might try to flip all the things (e.g. word-for-word, where a word is 16-bits - i would have said byte for byte if we could assume that a byte was 16-bits. i could also say character-for-character where character is the data type Char which represents a single code-point):

String original = "ɗỉf̴ḟếr̆ęnͥt";
String flipped = "";
foreach (Char c in s)
{
   flipped = c+fipped;
}

Results in the incorrectly flipped text:

This is because one "character" takes multiple "code points".

and flipping each "code point" gives:

Which not only is not a valid UTF-16 encoding, it's not the same characters.

Failure

The problem happens in UTF-16 encoding when there is:

Those same issues happen in UTF-8 encoding, with the additional case

i can limit myself to the simpler UTF-16 encoding (since that's the encoding that the language that i'm using has (e.g. C#, Delphi)

The problem, it seems to me, is discovering if a number of subsequent code points are combining characters, and need to come along with the base glyph.

It's also fun to watch an online text reverser site fail to take this into account.

Note:

  • any solution should assume that don't have access to a UTF-32 encoding library (mainly becuase i don't have access to any UTF-32 encoding library)
  • access to a UTF-32 encoding library would solve the UTF-8/UTF-16 lingual planes problem, but not the combining diacritics problem

Upvotes: 7

Views: 5181

Answers (3)

Agi Hammerthief
Agi Hammerthief

Reputation: 2134

Text Mechanic's Text Generator seems to do this in JavaScript. I'm sure it would be possible to translate the JS into another language after obtaining the author's consent (if you can find a 'contact' link for that site).

Upvotes: -1

bobince
bobince

Reputation: 536359

The term you're looking for is “grapheme cluster”, as defined in Unicode TR29 Cluster Boundaries.

Group the UTF-16 code units into Unicode code points (=characters) using the surrogate algorithm (easy), then group the characters into grapheme clusters using the Grapheme_Cluster_Break rules. Finally reverse the group order.

You will need a copy of the Unicode character database in order to recognise grapheme cluster boundaries. That's already going to take up a considerable amount of space, so you're probably going to want to get a library to do it. For example in ICU you might use a CharacterIterator (which is misleadingly named as it works on grapheme clusters, not ‘characters’ as Unicode knows it).

Upvotes: 3

Adrian McCarthy
Adrian McCarthy

Reputation: 47952

If you work in UTF-32, you solve the non-base-plane issue. Converting from UTF-8 or UTF-16 to UTF-32 (and back) is relatively simple bit twiddling (see Wikipedia). You don't have to have a library for it.

Most of the combining characters are in a few ranges. You could determine those ranges by scanning the Unicode database (see Unicode.org). Hardcode those ranges into your application. With that, you can determine the groups of codepoints that represent a single character. (The drawback is that new combining marks could be introduced in the future, and you'd need to update your table.)

Segment appropriately, reverse the order (segment by segment), and convert back to UTF-8 or UTF-16 (or whatever you want).

Upvotes: 2

Related Questions