Tushar Sircar
Tushar Sircar

Reputation: 11

chinese tokenizer stanford core nlp

Can somebody help me use the stanford core nlp to tokenize chinese text in java. This is my code so far:

File file = new File("example.txt");
   file.createNewFile();
   FileWriter fileWriter = new FileWriter(file);
   fileWriter.write("这是很好");
   fileWriter.flush();
   fileWriter.close();
   FileReader fileReader = new FileReader(file);

   InputStreamReader isReader = new InputStreamReader(new FileInputStream(file),"UTF-8");

   CHTBTokenizer chineseTokenizer = new CHTBTokenizer(isReader);

   String nextToken = "";
   while((nextToken = chineseTokenizer.getNext())!=null)
       System.out.println(nextToken);

But instead of getting 3 seperate tokens I'm getting the whole sentence as a single token. Can somebody help me out?

Upvotes: 1

Views: 904

Answers (1)

Sebastian Schuster
Sebastian Schuster

Reputation: 1563

The CHTBTokenizer is used to tokenize constituency trees in PTB format.

For plain Chinese text you have to use a segmenter which is also available from Stanford. You can find more information and a download link on the Stanford Word Segmenter page.

Upvotes: 1

Related Questions