Reputation: 11
Can somebody help me use the stanford core nlp to tokenize chinese text in java. This is my code so far:
File file = new File("example.txt");
file.createNewFile();
FileWriter fileWriter = new FileWriter(file);
fileWriter.write("这是很好");
fileWriter.flush();
fileWriter.close();
FileReader fileReader = new FileReader(file);
InputStreamReader isReader = new InputStreamReader(new FileInputStream(file),"UTF-8");
CHTBTokenizer chineseTokenizer = new CHTBTokenizer(isReader);
String nextToken = "";
while((nextToken = chineseTokenizer.getNext())!=null)
System.out.println(nextToken);
But instead of getting 3 seperate tokens I'm getting the whole sentence as a single token. Can somebody help me out?
Upvotes: 1
Views: 904
Reputation: 1563
The CHTBTokenizer
is used to tokenize constituency trees in PTB format.
For plain Chinese text you have to use a segmenter which is also available from Stanford. You can find more information and a download link on the Stanford Word Segmenter page.
Upvotes: 1