Reputation: 9
Please give me directions: How can I "teach" the splitter to split such paragraph: The paper is 7 cm. length. What is the painter name? the size of the picture is 5 cm. x 8 cm. into 3 parts. and not to 5 parts as done by default: 1) The paper is 7 cm. 2) length. 3) What is the painter name? 4) the size of the picture is 5 cm. 5) x 8 cm. Thanks, Aryeh.
Upvotes: 0
Views: 99
Reputation: 1563
The tokenizer is entirely rule-based so you can add custom abbreviations to it. You will have to edit PTBLexer.flex and recompile it using JFlex.
See also "stanford corenlp, splitting sentences, abbreviation exceptions".
Upvotes: 1