israkir
israkir

Reputation: 2131

To split only Chinese characters in java

I am writing a java application; but stuck on this point.

Basically I have a string of Chinese characters with ALSO some possible Latin chars or numbers, lets say:

查詢促進民間參與公共建設法(210BOT法).

I want to split those Chinese chars except the Latin or numbers as "BOT" above. So, at the end I will have this kind of list:

[ 查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, (, 210, BOT, 法, ), ., ]

How can I resolve this problem (for java)?

Upvotes: 8

Views: 6010

Answers (3)

jgani
jgani

Reputation: 186

Diclaimer: I'm a complete Lucene newbie.

Using the latest version of Lucene (3.6.0 at the time of writing) I manage to get close to the result you require.

  Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36, Collections.emptySet());

  List<String> words = new ArrayList<String>();
  TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(original));
  CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);

  try {
    tokenStream.reset(); // Resets this stream to the beginning. (Required)
    while (tokenStream.incrementToken()) {
      words.add(termAttribute.toString());
    }
    tokenStream.end(); // Perform end-of-stream operations, e.g. set the final offset.
  }
  finally {
    tokenStream.close(); // Release resources associated with this stream.
  }

The result I get is:

[查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, 210bot, 法]

Upvotes: 2

Ben Simmons
Ben Simmons

Reputation: 1938

Here's an approach I would take.

You can use Character.codePointAt(char[] charArray, int index) to return the Unicode value for a char in your char array.

You will also need a mapping of Latin Unicode characters.

If you look in the source of Character.UnicodeBlock, the full LATIN block is the interval [0x0000, 0x0249]. So basically you check if your Unicode code point is somewhere within that interval.

I suspect there is a way to just use a Character.Subset to check if it contains your char, but I haven't looked into that.

Upvotes: 1

BalusC
BalusC

Reputation: 1108972

Chinese characters lies within certain Unicode ranges:

  • 2F00-2FDF: Kangxi
  • 4E00-9FAF: CJK
  • 3400-4DBF: CJK Extension

So all you basically need to do is to check if the character's codepoint lies within the known ranges. This example is a good starting point to write a stackbased parser/splitter, you only need to extend it to separate digits from latin letters, which should be obvious enough (hint: Character#isDigit()):

Set<UnicodeBlock> chineseUnicodeBlocks = new HashSet<UnicodeBlock>() {{
    add(UnicodeBlock.CJK_COMPATIBILITY);
    add(UnicodeBlock.CJK_COMPATIBILITY_FORMS);
    add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS);
    add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT);
    add(UnicodeBlock.CJK_RADICALS_SUPPLEMENT);
    add(UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION);
    add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS);
    add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A);
    add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B);
    add(UnicodeBlock.KANGXI_RADICALS);
    add(UnicodeBlock.IDEOGRAPHIC_DESCRIPTION_CHARACTERS);
}};

String mixedChinese = "查詢促進民間參與公共建設法(210BOT法)";

for (char c : mixedChinese.toCharArray()) {
    if (chineseUnicodeBlocks.contains(UnicodeBlock.of(c))) {
        System.out.println(c + " is chinese");
    } else {
        System.out.println(c + " is not chinese");
    }
}

Good luck.

Upvotes: 11

Related Questions