Pacerier
Pacerier

Reputation: 89743

How to determine if a sequence of code points form a natural character?

Good afternoon all,

I am building a function that takes a string as input, removes any unnatural combining diacritic characters from the string, and returns the modified string as input.

An unnatural combining diacritic sequence is a sequence of unicode code points that when combined, produces output that does not belong to any language under the sun (ancient scripts/languages are considered natural languages).

For example, given the String input:

   "aaà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅaa" //code points 0061 0061 0061 0300 0301 0302 0303 0304 0305 0306 0307 0308 0309 030a 030b 030c 030d 030e 030f 0310 0311 0312 0313 0314 0315 0316 0317 0318 0319 031a 031b 031c 031d 031e 031f 0320 0321 0322 0323 0324 0325 0326 0327 0328 0329 032a 032b 032c 032d 032e 032f 032f 0330 0331 0332 0333 0334 0335 0336 0337 0338 0339 033a 033b 033c 033d 033e 033f 0340 0341 0342 0343 0344 0345 0346 0347 0348 0349 034a 034b 034c 034d 034e 0360 0361 0061 0061

, the function should return the result aaàaa (code points 0061 0061 0061 0300 0061 0061),

Since à́ (code points 0061 0300 0301) isn't a character in any natural language. In other words:

  assert F("aaà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅaa").equals("aaàaa");

Or for source code saved using latin charsets:

 assert F("\u0061\u0061\u0061\u0300\u0301\u0302\u0303\u0304\u0305\u0306\u0307\u0308\u0309\u030a\u030b\u030c\u030d\u030e\u030f\u0310\u0311\u0312\u0313\u0314\u0315\u0316\u0317\u0318\u0319\u031a\u031b\u031c\u031d\u031e\u031f\u0320\u0321\u0322\u0323\u0324\u0325\u0326\u0327\u0328\u0329\u032a\u032b\u032c\u032d\u032e\u032f\u032f\u0330\u0331\u0332\u0333\u0334\u0335\u0336\u0337\u0338\u0339\u033a\u033b\u033c\u033d\u033e\u033f\u0340\u0341\u0342\u0343\u0344\u0345\u0346\u0347\u0348\u0349\u034a\u034b\u034c\u034d\u034e\u0360\u0361\u0061\u0061").equals("\u0061\u0061\u0061\u0300\u0061\u0061");

How do we go about determining if a sequence of characters or a sequence of unicode code points are natural ?

Or rather, is there a limit to how many combining diacritic characters a character belonging to a natural language will use?

Upvotes: 4

Views: 436

Answers (3)

bobince
bobince

Reputation: 536615

An unnatural combining diacritic sequence is a sequence of unicode code points that when combined, produces output that does not belong to any language under the sun

I'm afraid you won't be able to satisfy this requirement without knowledge of all languages under the sun.

The nearest you can do with just the standard Unicode data set is to normalise to NFKC and see if there are any decomposed combining-class characters left. That doesn't tell you anything about natural languages, it only relies on the heuristic that there will probably be a combined character defined for the combinations that are in common use. That holds true for the most common simple alphabets which may be enough for you.

is there a limit to how many combining diacritic characters a character belonging to a natural language will use?

No. There is a practical limit stated in UAX 15 that 'stream-safe' text must not use 30 consecutive combining characters, which would allow us to speculate that the Unicode standard will in general attempt to avoid character definitions that would lead to that many consecutive joiners for a real-world language use case.

The longest natural grapheme cluster I know of is:

ཧྐྵྨླྺྼྻྂ

(one initial character and eight nonspacing marks.)

Upvotes: 1

McDowell
McDowell

Reputation: 108959

Unicode 6.0:

All combining characters can be applied to any base character and can, in principle, be used with any script. As with other characters, the allocation of a combining character to one block or another identifies only its primary usage; it is not intended to define or limit the range of characters to which it may be applied. In the Unicode Standard, all sequences of character codes are permitted.

This does not create an obligation on implementations to support all possible combinations equally well. Thus, while application of an Arabic annotation mark to a Han character or a Devanagari consonant is permitted, it is unlikely to be supported well in rendering or to make much sense.

There is unlikely to be enough information in the Unicode data to do this algorithmically.

There are some rules for canonical composition/decomposition that you could use to determine if a sequence is a "natural" sequence. For example, mapping U+0065 U+0301 to U+00E9 (é.) But this won't work for every case.

Beyond that, I'm not sure what you could do without using some form of validation table built by experts or generated from some corpus of language data.

Upvotes: 2

AlexR
AlexR

Reputation: 115378

I think that you just need Character.isLetter(). I have just tried it with English, Russian and Hebrew characters and it returns true for all letters and false for all characters that are not letters.

I do not know whether characters like '.', ',' etc are natural, but you can easily enumerate all this characters if you need them.

Upvotes: 1

Related Questions