Reputation: 29
I was trying to compare two spark dataframe which contains Japanese characters and there's some characters that seem the same but actually different to the program, such as プ vs プ
If you put them in utf-8 encoder:
プ utf-8 = \xE3\x83\x97
プ utf-8 = \xE3\x83\x95\xE3\x82\x9A
Looks like フ(\xE3\x83\x95) + the little circle semi-voice sign(\xE3\x83\x95) = プ
What are these difference called, and is there any way to convert between them in Java/Scala?
Thank you.
Upvotes: 2
Views: 125
Reputation: 159165
プ
aka \xE3\x83\x97
(UTF-8) is \u30d7
aka 'KATAKANA LETTER PU' (U+30D7).
プ
aka \xE3\x83\x95\xE3\x82\x9A
(UTF-8) is \u30d5\u309a
aka 'KATAKANA LETTER HU' (U+30D5) and 'COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK' (U+309A).
As you can see, the second is a base character and a combining character. This is the similar to how diacritical marks aka accent marks are done for Latin characters, e.g. how ñ
= n
+ ̃
aka \u00f1
= \u006e
+ \u0303
.
You can convert between the 2 forms using the Normalizer
class. See: javadoc.
See also: The Java™ Tutorials - Normalizing Text.
See also: Combining accent and character into one character in java 7
Upvotes: 3