Lorderon
Lorderon

Reputation: 171

OpenNLP SentenceDetector doesn't recognize whole sentence

I'm working on a research project and I need a NLP program to detect sentences in many different circumstances. I was advised to use OpenNLP and I am convinced to use it after reading it's wiki pages. So, I use OpenNLP in order to detect sentences as well as any words or phrases which are not belong to a sentence (also called sentence fragments).

OpenNLP accepts .txt files as input if you want to redirect the input. If you want to use .doc file as input, you have to convert it to a .txt file. My problem starts right here.

I have many different files in different formats. I would like to detect sentences in each file if they consist any text. Therefore, I started to convert each potentially text containing file to a .txt file. The conversion process is not perfect. For example, if a sentence too long (say longer than a line), then conversion tool gets the both lines of the sentence as separated sentences. This results OpenNLP produces each line as different sentences because of eoln character at the end of the first line.

My question is, is there anyway that I can parameterize or configure OpenNLP to recognize whole sentence (first and second line together)?

Upvotes: 1

Views: 879

Answers (2)

user4894151
user4894151

Reputation:

I suggest you, use apache Tika for that conversion of different files. Apache Tika has AutoDetectParser which detects different file types and extracts the data in it (Even metadata if you want) and you can save that into a .txt file.

Upvotes: 1

Daniel
Daniel

Reputation: 6039

Try your paragraph with new lines replaced with spaces with CoreNLP: nlp.stanford.edu:8080/corenlp/process

Upvotes: 0

Related Questions