Stanford NLP: Keeping punctuation tokens?

Question

I am looking for sentences such as

Bachelors Degree in early childhood teaching, psychology

I annotate the text using the Stanford Parser.
I then iterate each sentence and identify "Bachelor's Degree" using NER (named entity recognition).
By processing triples, I can see that the object follows "BE IN" and is likely to be a college major.
So I send the object phrase for further analysis. My trouble is that I don't know how to separate

early childhood teaching

from

psychology

My code for this procedure loops through the object triple and keeps it if certain POS requirements are met.

private void processTripleObject(List objectPhrase )
{
    try
    {
        StringBuilder sb = new StringBuilder();
        for(CoreLabel token: objectPhrase)
        {
            String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);

            TALog.getLogger().debug("pos: "+pos+"  word "+token.word());
            if(!matchDegreeNameByPos(pos))
            {
                return;
            }

            sb.append(token.word());
            sb.append(SPACE);
        }

        IdentifiedToken itoken = new IdentifiedToken(IdentifiedToken.SKILL, sb.toString());

    }
    catch(Exception e)
    {
        TALog.getLogger().error(e.getMessage(),e);
    }

Since the comma between teaching and psychology is not in the tokens, I don't know how to recognize the divide.

Can anyone advise?

Manos Nikolaidis · Accepted Answer

Note that token.get(CoreAnnotations.PartOfSpeechAnnotation.class) will return the token if no POS tag was found. Tested with CoreNLP 3.7.0 and "tokenize ssplit pos" annotators. You can then check if pos is in a String with punctuation points you are interested in. E.g this some code I just tested:

String punctuations = ".,;!?";
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
    for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
        // pos could be "NN" but could also be ","
        String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
        if (punctuations.contains(pos)) {
            // do something with it
        }
    }
}

Stanford NLP: Keeping punctuation tokens?

Answers (1)

Related Questions