Reputation: 1
Hi,guys:
I can use POS tagger to tag segmented Chinese text by calling the method
MaxentTagger.tokenizeText(Reader r)
but now I want to tag original Chinsese text just not segmented, I know the method
MaxentTagger.tokenizeText(Reader r,TokenizerFactory tokenizerFactory)
can do this, but TokenizerFactory is an interface, some subclass implementing it, how to call the method, can someone give some suggestions or examples? Thanks.
Upvotes: 0
Views: 249
Reputation: 9450
At present, the probabilistic sequence model word segmenters don't implement the TokenizerFactory interface, only the rule-based tokenizers ... though maybe this should be changed.
What you need to do is to run the Stanford Word Segmenter to break the original Chinese text into words. This means calling the CRFClassifier class with the appropriate model and properties. The output of this will be a List<CoreLabel>
(for a sentence) or a List<List<CoreLabel>>
(for a document). These can be fed into the tagger with the List<TaggedWord> apply(List<? extends HasWord> in)
or void tagCoreLabels(List<CoreLabel> sentence)
methods.
Or you might find it easier to use StanfordCoreNLP to connect the pieces together. It works with Chinese if you download the Chinese models jar available at http://nlp.stanford.edu/software/corenlp.shtml#History
Upvotes: 1