Reputation: 4538
I just used OpenNLP
for a small program where I was supposed to segment a paragraph into sentences.
Although I was able to complete the task after reading some documentation and going through their test cases, I couldn't help but notice that I still have to train for all the abbreviations (example Yahoo!), even when I created a custom abbreviation dictionary, passed it to SentenceDetectorFactory
and used it to train SentenceDetectorME
.
I am using similar approach as used in this test case.
I couldn't find this behaviour in their documentation nor I could find any explanation. Is there something I am missing?
Edit: Explanation of my problem
Although I am still working on making a training set as suitable for the domain I am working in, my test data is coming from unstructured data from web. Sometimes it contains an abbreviation that none of my team members ever anticipated. E.g.
Company (acq. by another company) is a good company.
In this case we never assumed the word acquired
to occur like acq.
which is clearly used as an abbreviation.
Now we can either add acq.
as an abbreviation and let the model continue working, as advertised, or train the model for it. But, even after adding it in abbreviation dictionary, it was not being treated as an abbreviation and we ended up training model for this abbreviation. This seems like a deviation from the concept of abbreviation dictionary.
I tried a small example in NLTK
with PunktSentenceTokenizer
like this one, and it works perfectly.
I am not sure if I have a training set with even 25,000 sentences, it will make a difference if OpenNLP
is ignoring abbreviation dictionary.
Upvotes: 1
Views: 426
Reputation:
How large is your training data?
As said in the documentation:
The training data should contain at least 15000 sentences to create a model which performs well.
that might be the problem, should give some big training data to make a model!
Upvotes: 2