sam
sam

Reputation: 1436

Preserve punctuation characters when using Lucene's StandardTokenizer

I'm thinking of leveraging Lucene's StandardTokenizer for word tokenization in a non-IR context.

I understand that this tokenizer removes punctuation characters. Would anybody know (or happen to have experience with) making it output punctuation characters as separate tokens?

Example of current behaviour:

Welcome, Dr. Chasuble! => Welcome Dr. Chasuble

Example of desired behaviour:

Welcome, Dr. Chasuble! => Welcome , Dr. Chasuble !

Upvotes: 1

Views: 1119

Answers (2)

yiping
yiping

Reputation: 96

You can consider using tokenization tool from NLP community instead. Usually such issues have been well taken care of.

Some off-the-shelf tools are stanford corenlp (they have individual components for tokenization as well). UIUC's pipeline should also handle it elegently. http://cogcomp.cs.illinois.edu/page/software/

Upvotes: 2

Alex Nevidomsky
Alex Nevidomsky

Reputation: 708

Generally, for custom tokenization of both IR and non-IR content it is a good idea to use ICU (ICU4J is the Java version). This would be a good place to start: http://userguide.icu-project.org/boundaryanalysis

The tricky part is preserving the period as part of "Dr.". You would have to use the dictionary based iterator; or, optionally, implement your own heuristic, either in your code or by creating your own iterator, which in ICU can be created as a file with a number of regexp-style definitions.

Upvotes: 2

Related Questions