Reputation: 1

How to use stanford pos tagger for Chinese original text just not segmented

Hi,guys:
I can use POS tagger to tag segmented Chinese text by calling the method MaxentTagger.tokenizeText(Reader r)
but now I want to tag original Chinsese text just not segmented, I know the method MaxentTagger.tokenizeText(Reader r,TokenizerFactory tokenizerFactory)
can do this, but TokenizerFactory is an interface, some subclass implementing it, how to call the method, can someone give some suggestions or examples? Thanks.

Upvotes: 0

Answers (1)

Christopher Manning

Reputation: 9450

At present, the probabilistic sequence model word segmenters don't implement the TokenizerFactory interface, only the rule-based tokenizers ... though maybe this should be changed.

What you need to do is to run the Stanford Word Segmenter to break the original Chinese text into words. This means calling the CRFClassifier class with the appropriate model and properties. The output of this will be a List<CoreLabel> (for a sentence) or a List<List<CoreLabel>> (for a document). These can be fed into the tagger with the List<TaggedWord> apply(List<? extends HasWord> in) or void tagCoreLabels(List<CoreLabel> sentence) methods.

Or you might find it easier to use StanfordCoreNLP to connect the pieces together. It works with Chinese if you download the Chinese models jar available at http://nlp.stanford.edu/software/corenlp.shtml#History

Upvotes: 1

How to use stanford pos tagger for Chinese original text just not segmented

Answers (1)

Related Questions