Reputation: 163
I need to use OpenNLP Parser for a specific task. The documentation suggests that you send it tokenized input, which implies that no further tokenization will take place. However, when I pass a string with parentheses, brackets, or braces, OpenNLP tokenizes them and converts them to PTB tokens.
I don't want this to happen, but I can't figure out how to prevent it.
Specifically, if my input contains "{2}", I want it to stay that way, not become "-LCB- 2 -RCB-". I now have 3 tokens where I once had one. I'd also strongly prefer not to have to post-process the output to undo the PTB tokens.
Is there a way to prevent OpenNLP Parser from tokenizing?
Upvotes: 1
Views: 400
Reputation: 1281
Looking at the javadocs, there are two parseLine methods, for one a tokenizer can be specified. I haven't tried the following, but I guess training your own tokenizer (https://opennlp.apache.org/docs/1.8.0/manual/opennlp.html#tools.tokenizer.training), which shouldn't be that much of a problem, revert to simple whitespace splitting if need be, and feeding that to the parseLine method (in addition to just the sentence and the number of desired parses should do the trick. E.g. something like the following:
public static void main(String args[]) throws Exception{
InputStream inputStream = new FileInputStream(FileFactory.generateOrCreateFileInstance(<location to en-parser-chunking.bin>));
ParserModel model = new ParserModel(inputStream);
Parser parser = ParserFactory.create(model);
String sentence = "An example with a {2} string.";
//Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);
// instead of using the line above, feed it a tokenizer, like so:
Parse topParses[] = ParserTool.parseLine(sentence, parser, new SimpleTokenizer(), 1);
for (Parse p : topParses)
p.show();
}
This particular piece of code still splits the { from the 2 in the input, resulting in:
(TOP (NP (NP (DT An) (NN example)) (PP (IN with) (NP (DT a) (-LRB- -LCB-) (CD 2) (-RRB- -RCB-) (NN string))) (. .)))
but if you train your own tokenizer and don't split on the cases you want to keep as a single token, guess this should work.
Upvotes: 1