Reputation: 1436
I'm thinking of leveraging Lucene's StandardTokenizer for word tokenization in a non-IR context.
I understand that this tokenizer removes punctuation characters. Would anybody know (or happen to have experience with) making it output punctuation characters as separate tokens?
Example of current behaviour:
Welcome, Dr. Chasuble! => Welcome Dr. Chasuble
Example of desired behaviour:
Welcome, Dr. Chasuble! => Welcome , Dr. Chasuble !
Upvotes: 1
Views: 1119
Reputation: 96
You can consider using tokenization tool from NLP community instead. Usually such issues have been well taken care of.
Some off-the-shelf tools are stanford corenlp (they have individual components for tokenization as well). UIUC's pipeline should also handle it elegently. http://cogcomp.cs.illinois.edu/page/software/
Upvotes: 2
Reputation: 708
Generally, for custom tokenization of both IR and non-IR content it is a good idea to use ICU (ICU4J is the Java version). This would be a good place to start: http://userguide.icu-project.org/boundaryanalysis
The tricky part is preserving the period as part of "Dr.". You would have to use the dictionary based iterator; or, optionally, implement your own heuristic, either in your code or by creating your own iterator, which in ICU can be created as a file with a number of regexp-style definitions.
Upvotes: 2