antnewbee
antnewbee

Reputation: 1949

Splitting Japanese text into words in java using BreakIterator

We are trying to break Japanese sentences into words using BreakIterator by following the code in this question. This code is working fine only for the text which is given in the question and when we try giving a different text e.g "速い茶色のキツネは怠惰な犬を飛び越えます" it is unable to break the words.

What could be the issue?

Upvotes: 0

Views: 793

Answers (1)

SATO Yusuke
SATO Yusuke

Reputation: 2184

BreakIterator.getSentenceInstance(Locale.JAPAN) in this question breaks a Japanese script into sentences, rather than words. Usually, the Japanese language is written without punctuation to separate words.

You have to use a morphological analyzer to break a sentence into words. For example, you can use a Java port of TinySegmenter.

import java.util.List;
import jp.toastkid.libs.tinysegmenter.TinySegmenter;

public class Test {
  public static void main(String[] args) {
      TinySegmenter ts = TinySegmenter.getInstance();
      List<String> list = ts.segment("速い茶色のキツネは怠惰な犬を飛び越えます。");
      System.out.println(String.join(" | ", list));
      // You will get "速い | 茶色 | の | キツネ | は | 怠惰 | な | 犬 | を | 飛び越え | ます"
  }
}

Upvotes: 1

Related Questions