Reputation: 75
I need to tag the words in Tweets, using Stanford POS Tagger.
As explained here 1, I used the Class MaxentTagger and then the method:maxtagger.tagString("This is a sample text");
This produce the output:
This_DT is_VBZ a_DT sample_NN text_NN
Now I have to create, for each tweet, an histogram of the occurrence for each tag in the tweet. I have searched in the JavaDoc, but found nothing useful.
If I have to create the histogram myself, how can I read the output in other ways than a string (for example, the list of the tags)?
Upvotes: 0
Views: 193
Reputation: 10905
I'd suggest to use the method tagCoreLabels() or tagSentence() instead. E.g. with tagSentence() you get back a list of TaggedWord from where you can easily access the pos tag using the tag() method. That should account words or models with POS tags that contain "_".
To create a List from a simple sentence string, use the PTBTokenizer, e.g.
List<CoreLabel> tokens = new PTBTokenizer<CoreLabel>(
new StringReader(s),new CoreLabelTokenFactory(),"invertible").tokenize();
Use the PTBEscapingProcessor to escape characters that have a special meaning in the parser models:
new PTBEscapingProcessor().apply(tokens)
I believe there is no specific support for histograms in the Stanford tools, but I may be wrong.
Upvotes: 2