Carlos D
Carlos D

Reputation: 180

Reliably rotating any string

I was experimenting with multibyte strings and how to handle them. Using the code that you can see here

https://gist.github.com/charlydagos/89f67808e01f97e6de91

I was successful in rotating most strings. However I noticed that the line

$chr = mb_substr($str, $i, 1);

Will not work for flag emojis, since they use more than a single unicode code point.

You can try the following in your own shells:

This gives desired output: $ php string_rotate_mb.php "δ½ ε₯½"

This however $ php string_rotate_mb.php "πŸ‡¨πŸ‡­" returns [H][C]

Which is technically correct, it did rotate the string. But really it's single glyph and my desired output is the flag alone (or a sequence of flags, which then becomes even more garbled glyphs, sometimes even turning it into different flags).

How can I, then, reliably determine that I should grab a $length = 1 or a $length = 2 (or a $length = N) substring using mb_substr?

For reference, I'm using PHP 7.0.2 (cli) (built: Jan 7 2016 10:40:26) ( NTS ), ZSH_VERSION = 5.2, LC_ALL=en_us.utf-8, and iTerm2: Build 2.9.git.8dff8db518.

Update - Feb 5th 2016

Solution: https://gist.github.com/charlydagos/6755ad994da07a7b4959#file-string_rotate_working-php-L39-L56

Thank you roeland for introducing the concept of Grapheme Clusters. Good info also in the following links

Upvotes: 0

Views: 123

Answers (1)

roeland
roeland

Reputation: 5741

There are a lot more examples where this fails:

  • Composing characters: compare eΜ‚ and Γͺ (the first one is actually U+0302 and U+0065)

  • Variants: eg. emoji can have a black/white or color variant πŸŽ‚οΈŽ vs πŸŽ‚οΈ. This is done by adding a variant selector after the emoji. similar problem with ethnic variations: πŸ™ŒπŸ» πŸ™ŒπŸΌ πŸ™ŒπŸ½ πŸ™ŒπŸΎ πŸ™ŒπŸΏ. (note: support for this is a bit spotty, but at least Windows 10 supports these variants)

  • Flags, which consist of two code points.

  • Fractions using the Fraction dash (U+2044) may be rendered with one glyph as well. Eg. 1⁄2. Note the difference with 1/2

And so on…

I think what you're looking for is called grapheme clusters. Without library support I think this is pretty difficult to get right.

For recent PHP versions there is the intl extension. You may loop over the clusters using the grapheme functions.

Upvotes: 1

Related Questions