Francesco
Francesco

Reputation: 75

POS Histogram with Stanford POS Tagger

I need to tag the words in Tweets, using Stanford POS Tagger.

As explained here 1, I used the Class MaxentTagger and then the method:maxtagger.tagString("This is a sample text");

This produce the output:

This_DT is_VBZ a_DT sample_NN text_NN

Now I have to create, for each tweet, an histogram of the occurrence for each tag in the tweet. I have searched in the JavaDoc, but found nothing useful.

If I have to create the histogram myself, how can I read the output in other ways than a string (for example, the list of the tags)?

Upvotes: 0

Views: 193

Answers (1)

rec
rec

Reputation: 10905

I'd suggest to use the method tagCoreLabels() or tagSentence() instead. E.g. with tagSentence() you get back a list of TaggedWord from where you can easily access the pos tag using the tag() method. That should account words or models with POS tags that contain "_".

To create a List from a simple sentence string, use the PTBTokenizer, e.g.

List<CoreLabel> tokens = new PTBTokenizer<CoreLabel>(
  new StringReader(s),new CoreLabelTokenFactory(),"invertible").tokenize();

Use the PTBEscapingProcessor to escape characters that have a special meaning in the parser models:

new PTBEscapingProcessor().apply(tokens)

I believe there is no specific support for histograms in the Stanford tools, but I may be wrong.

Upvotes: 2

Related Questions