Reputation: 49
i want to create a sentiment analysis program that takes in a dataset in Chinese and determine whether are there more of positive,negative or neutral statement. Following the example, i create a sentiment analysis for English (stanford-corenlp) which works exactly what i want but taking in Chinese.
Questions:
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
// gender,lemma,ner,parse,pos,sentiment,sspplit, tokenize
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
String sentimentText = "Fun day, isn't it?";
String[] ratings = {"Very Negative","Negative", "Neutral", "Positive", "Very Positive"};
Annotation annotation = pipeline.process(sentimentText);
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree tree = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class);
int score = RNNCoreAnnotations.getPredictedClass(tree);
System.out.println("sentence:'"+ sentence + "' has a score of "+ (score-2) +" rating: " + ratings[score]);
System.out.println(tree);
Currently, i have no idea on how to change the above code to have it support Chinese Language. I downloaded the Chinese praser and segmenter and seen the demo. But after days of trying, it didn't lead to anywhere. I have also read the http://nlp.stanford.edu/software/corenlp.shtml, it is really useful for the English version. Is there any ebooks, tutorial or examples that can assist me on understanding how the Chinese sentiment analysis of Stanford NLP works ?
Thanks in advanced!
PS: I picked up java not too long ago, pardon me if there is some things that i did not say or done correctly.
What i had researched:
How to parse languages other than English with Stanford Parser? in java, not command lines
Using stanford parser to parse Chinese
Upvotes: 0
Views: 3089
Reputation: 11
Even I'm working on the same problem and having issues. This is how much I have done:
You need to change the properties to support chinese language as follows:
props.setProperty("customAnnotatorClass.segment","edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator");
props.setProperty("pos.model","edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger");
props.setProperty("parse.model","edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz");
props.setProperty("segment.model","edu/stanford/nlp/models/segmenter/chinese/ctb.gz");
props.setProperty("segment.serDictionary","edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz");
props.setProperty("segment.sighanCorporaDict","edu/stanford/nlp/models/segmenter/chinese");
props.setProperty("segment.sighanPostProcessing","true");
props.setProperty("ssplit.boundaryTokenRegex","[.]|[!?]+|[。]|[!?]+");
props.setProperty("ner.model","edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz");
props.setProperty("ner.applyNumericClassifiers","false");
props.setProperty("ner.useSUTime","false");
But the problem that still persists is the tokenizer being used is still defaulting to PTBTokenizer(for English).
For Spanish the corresponding properties are: props.setProperty("tokenize.language","es"); props.setProperty("sentiment.model","src/international/spanish");
props.setProperty("pos.model","src/models/pos-tagger/spanish/spanish-distsim.tagger");
props.setProperty("ner.model","src/models/ner/spanish.ancora.distsim.s512.crf.ser.gz");
props.setProperty("ner.applyNumericClassifiers","false");
props.setProperty("ner.useSUTime","false");
props.setProperty("parse.model","src/models/lexparser/spanishPCFG.ser.gz");
This works just fine for Spanish. Notice the 'tokenize.language' property being set to 'es'. Such a property is not there for Chinese. I have tried to set it to 'ch','cn','zh','zh-cn' but nothing works. Tell me if you proceed further.
Upvotes: 0
Reputation: 111
Based on my experience with German language, here is what you need to do:
BuildBinarizedDataset
. Note that BuildBinarizedDataset
is set up for English language and will parse your sentences again. I found it more practical to apply the labels to my pre-existing parses from step 3.For the annotation: Either do this yourself or use a crowdsourcing service like CrowdFlower. I found the 'sentiment analysis' template on CrowdFlower to be useful.
Upvotes: 1