OpenNLP - Is training still required for abbreviation even with abbreviation dictionary?

Question

I just used OpenNLP for a small program where I was supposed to segment a paragraph into sentences.

Although I was able to complete the task after reading some documentation and going through their test cases, I couldn't help but notice that I still have to train for all the abbreviations (example Yahoo!), even when I created a custom abbreviation dictionary, passed it to SentenceDetectorFactory and used it to train SentenceDetectorME.

I am using similar approach as used in this test case.

I couldn't find this behaviour in their documentation nor I could find any explanation. Is there something I am missing?

Edit: Explanation of my problem

Although I am still working on making a training set as suitable for the domain I am working in, my test data is coming from unstructured data from web. Sometimes it contains an abbreviation that none of my team members ever anticipated. E.g.

Company (acq. by another company) is a good company.

In this case we never assumed the word acquired to occur like acq. which is clearly used as an abbreviation.

Now we can either add acq. as an abbreviation and let the model continue working, as advertised, or train the model for it. But, even after adding it in abbreviation dictionary, it was not being treated as an abbreviation and we ended up training model for this abbreviation. This seems like a deviation from the concept of abbreviation dictionary.

I tried a small example in NLTK with PunktSentenceTokenizer like this one, and it works perfectly.

I am not sure if I have a training set with even 25,000 sentences, it will make a difference if OpenNLP is ignoring abbreviation dictionary.

OpenNLP - Is training still required for abbreviation even with abbreviation dictionary?

Answers (1)

Related Questions