Reputation: 693
I'm trying out OpenNLP sentence detection tool. The text is in a file - para3.txt. Contents:
Bob went to London Mary came from Paris Now everything is fine.
I'm running this with following command:
opennlp SentenceDetector ../models/en-sent.bin < para3.txt
I get the output like this:
Bob went to London Mary came from Paris Now everything is fine.
Ideally, I would have seen three sentences as output:
Bob went to London.
Mary came from Paris.
Now everything is fine.
Now, if I try for other sentences, where "full stop" or "period" is present, sentence detection is happening fine. A human would have guessed that there are 3 sentences in the text, but how to get it done by OpenNLP? What tools of NLP could help here??? What is the next level of sentence detection?
Upvotes: 2
Views: 1400
Reputation:
you should train your model to detect these type of sentences i.e., sentence detector training as given in the documentation. create your training file en-sent.train : Sample training data file. The only requirement is that each sentence should be on a separate line in the training file like below.
Sentence 1
Sentence 2
Sentence 3
……
……
then using command line interface:
opennlp SentenceDetectorTrainer -model en-sent_trained.bin -lang en -data en-sent.train -encoding UTF-8
this will give a model file : en-sent_trained.bin
now use this .bin file instead of en-sent.bin
hope this helps!
Upvotes: 2
Reputation: 656
This seems to be a malformed text actually. You can use chucking information to divide it to sentences using some heuristics.
Upvotes: 0