how to use java substring on Japanese utf-8 kanji

Question

Is it possible to use substring to extract single utf8 kanji from a string? The problem is that utf-8 "characters" can have a length of 1, 2 or 3.

For instance, length of "𨦇𨦈𥻘" is 6 so String.substring(1, 2) doesn't get the first complete character.

In PERL, I could just use substr("𨦇𨦈𥻘", 1, 1) to get the first character, or substr("𨦇𨦈𥻘", 2, 1) to get the second character.

UPDATE: Based on @msandiford's suggestion, I came up with this.

public class SplitKanji {
    private String [] splitKanji;
    private SplitKanji(String string) {
        int cpCount = string.codePointCount(0, string.length());
        splitKanji = new String[cpCount];
        int nextSlot = 0;
        for (int i = 0; i < string.length();) {
            int ii = string.offsetByCodePoints(i, 1);
            splitKanji[nextSlot++] = string.substring(i, ii);
            i = ii;
        }
    }
    private String[] get() {
        return splitKanji;
    }
    public static void main(String[] args) {
        String startKanji = "私今日𨦇𨦈𥻘";
        SplitKanji myStuff = new SplitKanji(startKanji);
        String [] split = myStuff.get();
        System.out.print(startKanji + "=");
        for(String kanji: split)
            System.out.print(kanji + ":" + kanji.length() + ", ");
        System.out.println();
    }
}

clstrfsck · Accepted Answer

You can extract individual Unicode codepoints from the String like so:

  public static final String KANJI = "𨦇𨦈𥻘";

  public static void main(String[] args)
  {
    System.out.println(KANJI.length());                         // 6
    System.out.println(KANJI.codePointCount(0, KANJI.length()));// 3

    // Loop over each code point
    for (int i = 0; i < KANJI.length(); )
    {
      System.out.println(KANJI.codePointAt(i));
      i = KANJI.offsetByCodePoints(i, 1);
    }

    // Extract the third codepoint
    int indexForThirdCodePoint = KANJI.offsetByCodePoints(0, 2);
    int thirdCodePoint = KANJI.codePointAt(indexForThirdCodePoint);
    System.out.println(thirdCodePoint);

    // Convert codepoint back to string
    System.out.println(new String(Character.toChars(thirdCodePoint)));
  }

You could use the above techniques to obtain the start and end index of the codepoint that you require, and then use substring(start, end) to extract.

(edit) All of this could be simplified with a bit of judicious refactoring and utility functions. Below is one possible example; I don't know the use case for your code is, so it's a bit hard to know what would be best for you.

public static final String KANJI = "𨦇𨦈𥻘";

public static int lengthCodepoints(String s)
{
  return s.codePointCount(0, s.length());
}

public static String substringCodepoint(String s, int startCodepoint, int numCodepoints)
{
  int startIndex = s.offsetByCodePoints(0, startCodepoint);
  int endIndex = s.offsetByCodePoints(startIndex, numCodepoints);
  return s.substring(startIndex, endIndex);
}

public static void main(String[] args)
{
  int cpLength = lengthCodepoints(KANJI);
  for (int i = 0; i < cpLength; ++i)
  {
    System.out.println(substringCodepoint(KANJI, i, 1));
  }
}

how to use java substring on Japanese utf-8 kanji

Answers (1)

Related Questions