maga
maga

Reputation: 720

Separately tokenizing and pos-tagging with CoreNLP

I'm having few problems with the way Stanford CoreNLP divides text into sentences, namely:

  1. It treats ! and ? (exclamation and question marks) inside a quoted text as a sentence end where it shouldn't, e.g.: He shouted "Alice! Alice!" - here it treats the ! after the first Alice as a sentence end and divides the text into two sentences.
  2. It doesn't recognize ellipses as a sentence end.

In NLTK we would deal with these problems by simply normalizing text before and after dividing into sentences, that is, replacing the said marks with other symbols before dividing and returning them after to send them down the pipeline in a proper form.

However, the tokenizer in CoreNLP tokenizes before dividing into sentences and that doesn't leave much room to tweak the process. So, my first question: is it possible to "correct" the tokenizer without rewriting it to account for such cases?

If it's not, can we at least separate tokenization from the rest of the pipeline (in my case it's pos, lemma, and parse), so that we can change the tokens themselves before sending them further down?

Thanks!

Upvotes: 3

Views: 1232

Answers (1)

yvyas
yvyas

Reputation: 64

It seems to me that you would be better off separating the tokenization phase from your other downstream tasks (so I'm basically answering Question 2). You have two options:

  1. Tokenize using the Stanford tokenizer (example from Stanford CoreNLP usage page). The annotators options should only take 'tokenizer' in your case.

    java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt
    

    Once you do this, you can ask the other modules to not tokenize your input. For example, the Stanford Parser has a command-line flag (-tokenized) which you can set to indicate that your input is already tokenized.

  2. Use a different tokenizer (say, NLTK) to tokenize, and follow the second part of 1.

Infact, if you use any extrinsic tool to split text into sentences (basically chunks that you don't want to split any further), you have the option of setting a command-line flag in the CoreNLP tools which will not try and split your input. Again for the Stanford Parser, this is done by using the "-sentences newline" flag. This is probably the easiest thing to do, provided you have a reliable sentence detector.

Upvotes: 3

Related Questions