s59494
s59494

Reputation: 11

OpenNLP splits our sentences in half along special characters

We're facing an issue while processing text extracted from PDF documents (the content of which we do not have control over). Most of our text data happen to have sections which pose a challenge for OpenNLP which we use for detecting sentences for further processing. We are using the en-sent.bin model file from the OpenNLP website.

For example, one can often encounter GPS coordinates like 40° 43.554’ N, 73° 59.814’ W in these texts, where OpenNLP believes anything after a character must belong to a new sentence. This results in unwanted splitting of some of our sentences, for which we'd like to a find a solution or workaround.

The above character turns out not to be a regular single quote (U+0027), but one called 'RIGHT SINGLE QUOTATION MARK' (U+2019 or 0xE2 0x80 0x99 in hex). It looks like the sentence that contains the coordinates is split exactly along these.

We don't know how the en-sent.bin Sentence Detector model is trained or what character encoding it is working with (our input is UTF-8), as we found no such information in the documentation of OpenNLP (despite it mentioning that the character encoding to be used is specified during training of the model).

Filtering out such characters (i.e. all of those along which the splits happen) as a solution was dismissed, since we can't know for sure which ones are affected and it might also introduce the very similar problem of accidentally joining two sentences.

Since our team is highly inexperienced with OpenNLP, we're struggling to fix this. We have so far identified what we believe to be two candidate causes for the unwanted split, which I'd rather not post unless absolutely necessary, in order not to affect your thinking.

Please note that I'm obliged not to include our source code or the exact data we're feeding as those are highly confidential and the latter may contain personal or otherwise protected information.

Upvotes: 1

Views: 281

Answers (0)

Related Questions