Jake
Jake

Reputation: 4660

Stanford NLP: Keeping punctuation tokens?

I am looking for sentences such as

Bachelors Degree in early childhood teaching, psychology

early childhood teaching

from

psychology

My code for this procedure loops through the object triple and keeps it if certain POS requirements are met.

private void processTripleObject(List<CoreLabel> objectPhrase )
{
    try
    {
        StringBuilder sb = new StringBuilder();
        for(CoreLabel token: objectPhrase)
        {
            String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);

            TALog.getLogger().debug("pos: "+pos+"  word "+token.word());
            if(!matchDegreeNameByPos(pos))
            {
                return;
            }

            sb.append(token.word());
            sb.append(SPACE);
        }

        IdentifiedToken itoken = new IdentifiedToken(IdentifiedToken.SKILL, sb.toString());

    }
    catch(Exception e)
    {
        TALog.getLogger().error(e.getMessage(),e);
    }

Since the comma between teaching and psychology is not in the tokens, I don't know how to recognize the divide.

Can anyone advise?

Upvotes: 1

Views: 670

Answers (1)

Manos Nikolaidis
Manos Nikolaidis

Reputation: 22234

Note that token.get(CoreAnnotations.PartOfSpeechAnnotation.class) will return the token if no POS tag was found. Tested with CoreNLP 3.7.0 and "tokenize ssplit pos" annotators. You can then check if pos is in a String with punctuation points you are interested in. E.g this some code I just tested:

String punctuations = ".,;!?";
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
    for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
        // pos could be "NN" but could also be ","
        String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
        if (punctuations.contains(pos)) {
            // do something with it
        }
    }
}

Upvotes: 2

Related Questions